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ABSTRACT 



An irnrmived information retrieval system for 
retrieving information from a plurality of sources and for 
storing information source descriptions in a knowledge base. 
The user interface includes a hypertext browser and a 
knowledge base browser/editor. The hypertext browser 
allows a user to browse an unstructured information space 
through the use of interactive hypertext links. The knowl- 
edge base browser/editor displays a directed graph repre- 
senting a generalization taxonomy of the knowledge base, 
with the nodes representing concepts and edges representing 
relationships between concepts. The system allows users to 
store information source descriptions in the knowledge base 
via graphical pointing means. By dragging an iconic repre- 
sentation of an information source from the hypertext 
browser to a node in the directed graph, the system will store 
an information source description object in the knowledge 
base. The knowledge base browser/editor is also used to 
browse the information source descriptions previously 
stared in the knowledge base. The result of such browsing is 
an interactive list of information source descriptions which 
may be used to retrieve documents into the hypertext 
browser. The system also allows for querying a structured 
information source and using query results to focus the 
hypertext browser on the most relevant unstructured data 
sources. 

20 Claims, 8 Drawing Sheets 
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FIG. 3 



Algorithm GenerateSubPlan IE(i) .CM ,SD) 

E(w) is a query on a single world- view relation. CIW) is a constraint on the 
tuples that need to be computed, and SO is the collection of site descriptions. 
The output is a collection of sub-plans, one for each of the relevant site 
descriptions in SO. 

The following steps are performed for each site description SD« SD. 
1. If SO is of the form 111 or 12). i.e.. 

C R (?} n i EX i> VV sC E (i).E(W 

Cpiyi.fljt^i \\\) - c e iwi.E(Si 

and CIW) * C^IV) is satisfiable. generate a sub-plan for answering a 

fragment of E using traditional query optimization techniques on the 
conjunctive query. 



2. If SD is of the form 0) or Ml. i.e.. 

C R (X) .R(X) c C E m .EjIXj) E(W) E k tf k ) 

C R (X).RIX) = CflYI.EjtXj) E(9) E k lX k J 

C E (Y1 * CIW) is satisfiable. and X (the variables of the site relation Rl 

contain the variables of W. generate a sub-plan for answering a fragment 
of E using traditional query optimization techniques on the conjunctive 
query : 



3. In the case when E is a unary concept relation, we perform the first two 
steps for concept relations E' that are subconcepts of E. 



r W ,(T C R (YKCW) 'Ml 1 




V%IXKIW) ,R(Xm - 
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FIG. 4 



Algorithm DynanicPlanEval (O(X).SD) 

(Q(X) is the query, and SO is the collection of site descriptions. 

1. Determine an order Ejlfy fc^fyl of joining the conjunct? in 0(8). 

Let Pj. 0 < i < k denote a set of pairs of the torn tt.Ct?)) .where t 

is a tuple in the join of relations Ej Ej .and C(?) is a constraint. 

coiputed as described below. P Q is defined to have a single pair, whose 

tuple component has the empty tuple and whose constraint component 
has Cq. the query constraints. 

2. Perform the following steps i » 1 to k. 

la) For each tuple (t.CIYl)e P^ do 

i. Let CjIXjt denote the projection of C(Y) on the variables in K|. 

405 -i ii. Generate a sub-plan for computing the tuples in the relation 
Ej CX|1 satisfying the constraint Cj(Xj). using the site 

descriptions SD. 

iii. Let tj be a tuple computed for Ej using a site description SD. 

Let CjlXj) denote the projection of Cp * Cfj 0 on the variables 

Xj. where Cp and CpLare the constraints on the two sides of 
the site descrigtion Su! — 409 

For each tuple tjin Ej and matching Cjlfy.add the pair (t* 
Tj.CI?) * Cjtfjl) to Pj. where Hj denotes concatenation of 
tuples. 

3. The answers to the query can be computed iron P k by taking each tuple 
in the tuple component and projecting it on the variables of O(X). 



407 



401 
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FIG. 5 
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USER INTERFACE FOR INFORMATION about information sources, typically just a Universal 

RETRIEVAL SYSTEM Resource Locator (URL), which can be thought of as an 

information source address in the WWW. and some text that 
RELATED APPLICATIONS may or may not accurately describe the contents of the 

5 . information sources. 

The present application is a continuation-in-part of U.S. The present invention solves the shortcomings of the prior 
patent application Ser. No. 08/347.016. filed Nov. 30, 1994. ^ by providing an improved information retrieval system 
and now U.S. Pat No. 5.600.831 which is a continuation- uscr interface, 
iii-part of U.S. patent application Ser. No. 08/203.082. filed 

FebT28. 1994. SUMMARY OF THE INVENTION 

The present invention provides an improved user interface 
FIELD OF THE INVENTION for m irf onn ation retrieval system. In the preferred 

This invention relates to information retrieval generally. embodiment the information retrieval system retrieves 
More particularly, the invention relates to an improved information from a plurality of information sources and 
information retrieval system user interface, i5 storcs information source descriptions in a knowledge base. 

These information source descriptions contain various 
BACKGROUND OF THE INVENTION, attributes which describe the information source. 

^ , The interface includes a hypertext browser coupled with 

T*e Internet is a global computer network providing a taowledge base browser/editor. The hypertext browser is 
access to a large, distributed body of information. The used to an ^0^00 space, such as the World 

collection of information accessible throughout (his network 20 mdc ^ ^ knxNf]edgt base browser/editor displays a 
is generally not organized cc indexed, making the task of ^ which rcprcscn ts a generalization taxonomy 

locating .useful ^information difficu^The difficulty of finding of mc ^ te m mc tacwtedge When an information 
and retrieving information is exacerbated by me multiplicity source (such as a 0 f interest is retrieved, the user 

of protocols used for interacting with information and ser- may store an information source description in the knowl- 
vice providers, numerous formats for different types of 25 ed ^ basc via me ^aphical user interface. For example, by 
multimedia .data, and the rapidly growing and changing pointing to an icon in the document of interest and dragging 
topology of the network. the icon into the knowledge base browser/editor, the system 

The World Wide Web (WWW) is an initiative to simplify will store an information source description object in the 
navigation of mis sea of information. The WWW encom- M knowledge base. The system will automatically extract 
passes a family of Internet protocols and a hypertext data certain information source description attributes from the 
model to enable more convenient access to multimedia data. document Hie user may specify a particular knowledge 
Hypertext links, which are embedded in the hypertext base concept that the information source description is to be 
documents, express relationships among pieces of an instance of by dragging the icon to a particular node in the 
information, as well as location, format and access method 3J directed graph. The system also provides means for textually 
for retrieving the data designated by the link. Software editing the information source description attributes prior to 
interfaces to the WWW present this data to users in such a adding the information source description as a knowledge 
way that retrieval of data is performed by simple operations base object. 

on these hypertext links. These interfaces ease the task of The knowledge base browser/editor is also used to browse 
navigation, retrieval, and presentation of information by w the knowledge base. If a user points to a node in the directed 
hiding details of access. graph, the system displays a list of information source 

The hypertext model while simple and convenient to use. description objects which are stored as instances of the 
does not contribute to creating rational organizations of concept related to that node. This list is interactive in mat the 
information. On the contrary, the relationships implied by user may point to one of the displayed objects and the 
the links are arbitrary, so the interconnected body of infer- 4S document related to the object will be retrieved and dis- 
mation within the WWW is still mostly unstructured and played in the hypertext browser. Hie system also allow for 
disorganized. The result is that information retrieval on the a user to perform more complex queries on the knowledge 
WWW is still a laborious and time-consuming process. base by entering a textual query. 

One way that existing software interfaces to the WWW The information space browsed by the hypertext browser 
(called WWW clients) help with this process is to provide a 50 will typically contain unstructured data sources. These data 
way to keep track of interesting information sources, by sources are appropriate for browsing in that mere is no 
allowing users to save links so that the process of locating defined structure to the information. In accordance with 
the information source does not have to be repeated for another aspect of the invention, a structured database query 
future access to the information. In particular, many users may be used to provide a user with information from an 
find useful information sources that they want to be able to 55 unstructured data source. A user makes a request for infer- 
return to easily. The current state of the art of WWW clients mation to the system as a query. The system responds to the 
allows these links to be recorded in lists. Such lists provide query by retrieving as much information as possible from 
an alternative way to navigate the WWW. allowing direct the structured data sources. This information is then used to 
access to a previously accessed information source. Such a prune the set of unstructured data sources to identify a subset 
mechanism has proven to be practically essential for effec- $0 of such sources. The hypertext browser then browses this 
tive WWW navigation. subset of unstructured data sources. In this manner, the user 

The weakness of this approach is that these lists quickly is focused on the unstnjetured information sources which are 
become unmanageable as they grow in size. Finding previ- most relevant to the request for information, 
ously stored information in a large list can be difficult These and other advantages of the invention will be 
Similarly, the lack of the ability to view an overall organi- 65 apparent to those of ordinary skill in the art by reference to 
zation of the information reduces the effectiveness of such the following detailed description and the accompanying 
lists. In addition, these lists retain minimal information drawings. 
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BRIEF DESCRIPTION OF THE DRAWINGS 101 into a set of concepts which fit the manner in which the 

FIG. 1 is a conceptual overview of the information use f. «* ? stem , 101 *» ^tending to view -and lust >the infor- 

letrieval system described in the parent of the present ^t.^f? ^V*™" 0 ^ 111 ha$ 

aoolication' poncnts: world view 115, which contains concepts corrc- 

- 5 sponding to the way in which a user of the system looks at 

FIG. 2 is a detail of a site description in a preferred the information being retrieved system/network view 117. 

embodiment described in the parent of (he present appiica- which contains concepts corresponding to the way in which 

uon : the information is described in the context of the data bases 

FIG. 3 shows the algorithm employed in the preferred which contain it and the communications protocols through 

embodiment described in the parent of the present appiica- 10 which it is accessed and information source descriptions 

tion to generate query subplans; 113. which contains concepts describing the information 

FIG. 4 shows the algorithm employed in the preferred sourccs at a COSiCe P^^ lcv ^- System/network view 117 and 

embodiment described in the parent of the present appiica- infonnation source dcsc * tio L ns 113 « normally not visible 

tion for dynamically generating a query plan; u ^ ™ e CO ™3* s J* F 01 ? 005 of d ? msm model 

r«_. / . Ill do. however, participate fully in the reasoning processes 

FIG. 5 is a detailed block diagram of access plan genera- 15 ^ determine how to satisfy a query 

tion and exeoitlon component 119 of information re^eval ^ in ^ ortaBt ^ncto of using a description logic system 

- systemlOl in the preferred embodiment described in the ^ CLASSIC is that "as 'new Moonadon is added to the 

parent of the preseot apphcanon; system, much of the work of organizing the new information 

FIG. 6 shows a first screen display of a preferred embodi- with respect to the concepts already in knowledge base 109 

ment of the user interface in accordance with the present 20 is ^ne automatically. Only a description of the known 

invention; attributes of the information must be specified; CLASSIC'S 

FIG. 7 snows a second screen display of a preferred . inference mechanisms then automatically classify these 

embodiment of the user interface in accordance with the descriptions into appropriate places in the concept hierarchy, 

present invention; and User interaction with the system is accomplished through 

FIG. 8 shows a display of the path history browser of a 25 browsing and querying operations in terms of high-level 

preferred embodiment of the user interface in accordance concepts (concepts that are meaningful to a user unsoptris- 

with the present invention. ticated in the details for information location and access). 

TWTATf nn nnQt-pnrrrnv Thcsc concc P ts wc intended to reflect the terms in which the 

UK rAlUiU DKSUOrnUN user thinks about the type and content of information being 

The following detailed description contains material from queried. By working with these high-level concepts, the user 

the detailed description of the parent of the present is unburdened with the details of the location and distribu- 

application, U.S. patent application Ser. No. 08/347*016. tion of information across multiple remote information 

through the section entitled Obtaining Domain and Source servers. 

Information and including FIGS. 1-5. The new material Information sources 123 are generally (though not limited 

which is being added in the present application begins at the 35 to) network-based information servers that are accessed by 

section entitled Improved User Interface and includes FIGS. standard internet communication protocols. Sources can also 

include databases, ordinary files and directories, and other 

Architecture knowledge bases. 

Architecture Overview ^ Query Translator 107 

FIG. 1 presents an overview of an information retrieval The query language used in system 1*1 is based on 

apparatus 101 which incorporates the principles of the CLASSIC, but has additional constructors that enable the 

invention. A preferred embodiment of information retrieval user to express queries more easily. The query is formulated 

apparatus is implemented using a digital computer system in terms of the concepts and objects that appear in the world 

and information sources which are accessible via the Inter- 45 view part 115 of the knowledge base. Query translator 107 

net communications network. translates queries expressed in the query language into 

The central component of apparatus 101 is a knowledge CLASSIC description language expressions which are used 

base 109 built upon a description logic based knowledge to consult the knowledge base. Due to the limited expressive 

representation system (CLASSIC in the preferred power of the description language and the need for special 

embodiment) which is capable of performing inferences of so PUfP 0 ^ query operators, the query language may contain 

classification, subsumption. and completion. Knowledge- elements not expressible in the description language of 

base systems are described generally in Jcffery D. TJUroan. knowledge representation system 109. After partial tran&la- 

Principles of Database and Knowledge-base Systems, Vols. tion to a description language expression, the reinaining 

ML Computer Science Press. RockviUe. Md. 1989. fragments of the query are translated to procedural code that 

Descriptions of CLASSIC may be found in Alex Borgida, 55 is executed as part of the query evaluation. 

Ronald Brachman. Deborah McGuinness. and Lori Resnick. Knowledge Representation System 109 

"CLASSIC: A Structural Data Model for Objects**, in Pro- The knowledge base is a virtual information store in the 

ceedings of the 1989 ACM SIGMOD International Confer- sense that the information artifacts themselves remain exter- 

ence on Management of Data, pp. 59-67. 1989. R. J. nal to the knowledge base; the system instead stores detailed 

Brachman. et aL. 'living with CLASSIC, in; J. Sowa, ed. eo information (in terms of domain model 111) about the 

Principles of Semantic Networks: Explorations in the Rep- location of these information artifacts and how to retrieve 

resentations of Knowledge. Mc^an-Kauimann. 1991. pp. them. Retrieval of a particular piece of information is done 

401-456. and L. A. Resnick, et aL. CLASSIC: The CLASSIC on demand when it is needed to satisfy part of a query. The 

User's Manual* AT&T Bell Laboratories Technical Report. types of information managed in this manner include files. 

1991. 65 directories, indexes, databases, etc. 

Knowledge base 109 is used to construct a domain model The domain model embodied in the knowledge base is 

111 which organizes information accessible via apparatus logically decomposed into world view 115. system/network 
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view 117. and information source descriptions 113 World course involve inferences based on concepts from 

view 115 is the set of concepts with which the user interacts information source descriptions 113 and/or system/ 

and queries are expressed. System/network view 117 con- network view 117 and the results of the search thus far. 

cents low level details which, though essential for generat- 2. Plan materialization: The previous step produced apian 

ing successful query results, are normally of no interest to 5 at the level of logical source accesses. This step takes 

the user. Information source descriptions 113 is a collection these logical accesses and translates them to specific 

of concepts for describing information sources. These infer- network commands. This phase has two aspects: 

mation source descriptions are expressed in terms of both Format translation: the description of the sites is given 

world and system concepts. The purpose of encoding infor- a* a logical level. However, to actually access the 

mation source descriptions 113 in die domain model is to to site, one must conform to a syntax of a specific query 

make it possible for CLASSIC to reason about what infor- language. In this step, these translations are done, 

mation sources must be consulted in order to satisfy a query. Specific network commands are generated to access the 

We define system concepts comprising system/network in f orraation fro *° the system/network 

view 117 as those concepts that describe the low-level ™" * taken J m '° accounL I *P cndm S °» °* ** e 

details of information access. This includes concepts related is being accessed, the system will generate the appro 

to network coramumcatioc protocols, location addressing. _ V™ te . commands for performing the access, 
storage formats, index types, network topology "and" - ^t^^S'pZlZ' 

connectivity, etc. Since the knowledge base generally n " md J s A pe **J ned 5J ^<>^ tt ^ Ac«ss Protocol Mod- 

merely retrieves information instead of storing previously- "> «>-»>• ln ^tctomag ^ oa - 

retrieved information, system/network view 117 includes all » Several P° mts ** noted **>* tne P"**" 1 

those concepts relevant to determining attributes like b executing the plan, system 101 uses a work space in the 

location, retrieval methods, and content format computer system upon which system 101 is imple- 

Continuing in more detail, concepts within world view Aft mcntcd * > Store * ^T^T^n „ •„ 

115 describe things with which the user is familiar; they are **" «fcutmg part of the plan, system 101 may decide 

the concepts that describe characteristics of information 25 to replan for me r^ of tte query, 

artifacts of interest to users. Concepts within information formation ^ ccess Protoco1 Modules 121 

source descriptions 113 relate the concepts in world view A fff <? mformat1011 is f™* using a variety of 

115 to concepts concerning the semantic content of infor- mformation access protocols. The purpose of these 

mation sources. Thus, given a query which employs con- *> g^ cnc information access operations 

cepts in world view 115, knowledge representation system 30 (n**** ^ting coUecbons. searching indexes) into corre- 

109 can employ the concepts in information source descrip- spoiidmg operations of the form expected by the mfonnation 

tions 113 to relate the concepts used in the query to actual , FoT . many standard Internet access protocols, the 

information sources and can employ system/network view translation is straightforward. 

117 to relate the concepts used in the query to an access plan . Examples of access protocols supported by these module^ 

which describes how to retrieve information from the 35 ™*»<* k r^tocols ^fin^by^ernet 

c™«™~ oc r^t.i^A * rt *k* draft standard documents, including FTP (File Transfer 

sources as required to answer the query. ^ _ . m 

~ . . tj J Protocol), Gopher. NNTP (Network News Transfer 

Access Plan GeneraUon and Execution Protocol). HTTP (Hypertext Transfer Protocol). In addition. 
When a user wishes to obtain information, the user inputs othcT n**^ suppor t access to local (as opposed to 
a query in system Mi's query language at graphical user ^ network-based) information repositories, such as local file- 
interface 103. System 101 then answers the query. There are systems and databases, 
several steps involved. First, query translator 107 translates Description Language 

the query into a form to which knowledge representation M previously pointed out, the concepts in iiiformation 

system 109 can respond. Then the translated query is ana- source descriptions 113 relate concepts in world view 115 to 

lyzed in knowledge base system 109 to decide which of the 45 information sources 123. These relationships are expressed 

external information sources are relevant to the query, and using a $ite description language. CLASSIC and related 

which subqueries need to be sent to each information source. knowledge representation systems employ description lan- 

This step uses world view 115 and system/network view guagcs which ^ faction as site description languages, but 

117. The information m system/network view 117 is ^ ri|c description languages do not permit efficient rea- 

expressed in a site description language which will be ^ soning. In a preferred embodiment, efficiency has been 

described in more detail later. substantially increased by the use of a site description 

Knowledge base 109 uses the conceptual information language which extends CLASSIC 

from world view 115 and system/network view 117 to The following discussion of the site description language 

produce an information access description describing how to employed in the preferred embodiment employs the example 

access the information required for the query in information below: 

sources 123. Knowledge base 109 provides the information Consider an application in which we can obtain informa- 

access description to access plan generation and execution tion about airline flights from various travel agents. We have 

component 119, which formulates an access plan including access to fares given by specific travel agents and to tde- 

the actual conunands needed to retrieve the information phone directory information to obtain their phone numbers, 

from sources 123. $o In practice, the information about price quotes and telephone 

1. Plan formulation: Given the information access listings may be distributed across different external database 

description, planner 119 decides on the order in which servers which contain different portions of the information, 

to access sources 123 and how the partial answers will For example, some travel agent may deal only with domestic 

be combined in order to answer the user's query. The travel another may deal with certain airlines. Some travel 

key distinction between this step and traditional data- 65 brokers deal only with last minute reservations, eg., fights 

base techniques is that planner 119 can change the plan originating in the next one week. Similarly, directory infor- 

after partial answers are obtained. Replanning may of mation may be distributed by area code. In some area codes. 
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all listings may be in one database, while others may 
partition residential and business customers. 

The starting point for the site description language is the 
description language used in CLASSIC. A description lan- 
guage consists of three types of entities: concepts 
(representing unary relations), redes (binary relations) and 
individuals (object constants). Concepts can be defined in 
terms of descriptions mat specify the properties that indi- 
viduals must satisfy to belong to the concept. Binary rela- 
tionships between objects are referred to as roles and are 
used to construct complex descriptions for defining con- 
cepts. Description logics vary by the type of constructors 
available in the language used to construct descriptions. 
Description logics are very convenient for representing and 
reasoning in domains with rich hierarchical structure. 
Description languages other than the one uses in CLASSIC 
.. exist and may be used as. starting. pointsior.site description 
languages* Hie only requirement is that me question of 
subsumption (i.e~ does a description D x always contain a 
description D 2 ) be decidable. We denote the concepts in our 20 
representation language by D=D A D t 

In our example, we can have a hierarchy of concepts 
describing various types of telephone customers. The con- 
cept customer is a primitive concept that includes all cus- 
tomers and specifically the disjoint subconcepts Business 2s 
and Residential Each instance of a business customer has a 
role Businessiype. specifying the types of business it per- 
forms. Given these primitive concepts, we can define a 
concept TravelAgent by the description. 

(AND Business (fills Bu sines slype "Travel")). 30 

One limitation of description languages is mat they do not 
naturally model general n-ary relations (A relation may be 
thought of as a a table with columns and rows. An n-ary 
relation has n columns.) n-ary relations arise very commonly 
in practice and dealing with such relations is essential to 35 
modeling external information sources that contain arbitrary 
relational databases. Hence our representation language 
augments description languages with a set of general n-ary 

relations £=& L H^. It should be emphasized that the 

general n-ary relations are not part of the description lan- 40 
guage. Hereafter, we refer to the set of relations euD as the 
knowledge base relations, to distinguish them from relations 
stored outside knowledge representation system 109. Our 
application domain is naturally conceptualized by the fol- 
lowing two relations: 45 

Quote(ag. ai sre, dest c. d), denotes that a travel agent ag 
quoted a price of c to travel from sre to dest on airline 
al on date d. 

Dir(cust. ac. telNo). gives the directory listing of cus- 
tomer cust as area code ac and phone number telNo. so 

A key aspect of our representation language is the ability 
to capture rich semantic structure using constraints, with 
which CLASSIC can reason efficiently. An atomic constraint 
is an atom either of the form D(x). where D is some concept 
in D. and x is a variable, or (x.-Gx,) (or (x, 8a)) where and 55 
x, are variables, a is a constant and (te{>,£:, < % £,ss,*}. 
Arbitrary constraints are formed from atomic constraints 
using logical operators " and V . CLASSIC can determine 
efficiently whether one class subsumes another using sub- 
sumption reasoning in the description logic. Other well- 60 
known techniques are used for implication reasoning of 
order constraints. For details, see the Uliman reference cited 
above. Any atomic constraint may be used about which 
impticaticMi/subsimnrion reasoning can be done efficiently. 
Constraints play a major role in information gathering and 65 
are used in several ways. First semantic knowledge about 
the general n-ary relations € can be expressed by constraints 
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over the arguments of the relations. In our example, we can 
specify that the first argument of the relation Quote must be 
an instance of the concept TravelAgent Second, as we 
discuss in subsequent sections, constraints can be used to 
5 specify subsets of information that exist at external sites. For 
example, a travel agent may have only flights whose cost is 
less than $1000. Finally, as we see below, constraints are 
extremely useful in specifying complex queries. 
Constraints may be used together with concepts and 
10 knowledge base relations to describe properties of exten- 
sions of the knowledge base relations, that is. information 
specified by the knowledge base relations and the properties. 
The information in the extension may come from the knowl- 
edge base, but most often it will come from one or more of 
15 the information sources 123. We assume that the definitions 
of the concepts exist in the knowledge base, although the 
-extensions of the < concepts -and the relations- may- not -be- 
entirely present in the knowledge base. However, we assume 
that constraints contain only concepts whose extensions 
exists in the knowledge base. 

Given a query (defined formally below), the knowledge 
base system must infer the missing portions of the exten- 
sions of relations needed to answer the query, using the 
information present at the external sites. For the purpose of 
our discussion, the knowledge base can also be viewed as an 
information source containing part of the extensions. 

It should be realized that the problem of finding relevant 
sites is a crucial problem for system 101. Economical 
solutions to the problem are important not only for answer- 
ing queries, but also for other operations. Examples include 
Processing updates on the knowledge base requires updat- 
ing relevant site relations and hence, determining the 
relevant sites. 

Efficiently monitoring queries over time requires deter- 
mining precisely which external site relations should be 
monitored. 

Maintaining consistency among site relations again 
requires that we determine which sites contain infor- 
mation relevant to a given consistency condition. 
Finding the relevant sites is done by extending the algo- 
rithm described in Alon Y. Levy and Yehoshua Sagiv. 
"Constraints and Redundancy in Datalog* 1 . Proceedings of 
the Eleventh ACM SIGACT-SIGMOD-SIGART Sympo- 
sium on Principles of Database Systems. San Diego. Calif.. 
1992. The key observation that enables us to use that 
algorithm is that the language for expressing constraints 
(concept descriptions and order constraints) satisfies the 
requirements of the query-tree algorithm outlined in mat 
paper. Finding minimal portions of the sites is done in two 
steps. The first step determines which portions of the knowl- 
edge base relations are needed to solve the query, and the 
second step determines which portions of the site relations 
are needed to compute the relevant portions of the knowl- 
edge base relations. The algorithm uses the query-tree, 
which is a tool that, given a query which is expressed in 
terms of certain relations will specify which portions of the 
mentioned relations are relevant to the query. Trie first step 
is done by building a query-tree for the user query, in terms 
of the knowledge base relations, and pushing the constraints 
from the query to the KB relations. The second step is done 
by building a query-tree for each relevant KB relation 
(which is defined in terms of the external sites), and pushing 
the constraints to the external site relations. 

EXAMPLE 5.1 
There are currently many systems providing access to 
large collections of databases. Consider such a system. 
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which provides access to two kinds of databases: (1) the 
flight information and price quote databases of various 
airlines and travel agents in the U.S.. and (2) the telephone 
directory databases of various telephone companies in the 
U.S., to obtain the phone numbers of the various travel 
agents. 

These different databases often contain the same infor- 
mation redundantly. For example, the United Airlines data- 
base contains information about United flights and price 
quotes, while the database of some travel agent may have 
flight and price quote information about domestic flights in 
the U.S. Similarly, the telephone directory information may 
exist in databases distributed by area code, or in databases 
distributed by types of customers (e.g.. travel agents). 
.Aj A user accessing this collection of databases may be 
^ interested in obtaining a variety of information, eg., the 
cheapest -flight offered .by- any- airline. or travel agent, the 
phone number of travel agents who offer the cheapest deals, 
eta A key problem facing the user of such a current day 
system is that to find information of interest, the user needs 
to search jhe_varioui_databases one by one. which is 
extremely time-consuming and expensive. This problem is 
exacerbated by the fact that the price quote databases, for 
example, provided by different travel agents may use dif- 
ferent schemas . and different conventions for representing 
their information. 
World-View 115 

World-view 115 in the preferred embodiment consists of 
the following types of entities: 

General n-ary relations: The attribute values of these rela- 
tions are drawn from a rich set of types, which includes 
primitive types such as integers and strings, as well as more 
complex types defined by CLASSIC concepts (described 
below). We refer to these relations by e. 
Concepts and objects: The data model of the w odd- view 
includes CLASSIC concepts and objects. In CLASSIC, 
concepts (which correspond to classes in object-oriented 
databases) are defined in terms of descriptions that specify 
the properties that objects must satisfy in order to belong to 
the concept A collection of CLASSIC concepts can be 
viewed as a rich type hierarchy. 

A concept can itself be viewed as a unary relation; the 
extension of this relation is the collection of all objects that 
satisfy the concept description. We denote the concepts in 
world-view 115 by D. The set of relations W=Due are 
collectively referred to as the world-view relations, and are 
type-set in mis font 

Constraints: An important part of the data model of the 
world-view is the ability to express rich semantic informa- 
tion about the world-view relations using constraints, such 
as order constraints (e.g., AO=212. Cost<1000). Note that 
concepts can also be used to express semantic constraints. 

Having general n-ary relations in the world-view is essen- 
tial for modeling sites that contain arbitrary relational data- 
bases. (This feature is not present in the world-view of the 
SIMS system, for example.) For details on SIMS, see Y. 
Arens. C. Y. Chee. C. nan Hsu. and C. A. Knoblock. 
"Retrieving and integrating data from multiple information 
sources**. International Journal on Intelligent and Coopera- 
tive Information Systems. 1994. However, a well-known 
problem with the relational data model is that it does not 
provide a rich type structure for values that occur in argu- 
ment positions of relations. Allowing for values to be drawn 
from a rich set of types would considerably increase the 
modeling capabilities of the relational data model. This is 
achieved in our world-view by augmenting the relational 
model with CLASSIC'S object-oriented model. 
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Note that our world-view does not explicitly include 
object attributes. The reason is that an attribute A of a 
concept C can be viewed as a binary relation, where the first 
argument of the relation is of type C and the second 
s argument of the relation has the type of attribute A as its 
type. This is just a special case of general n-ary relations, 
which are included in our world-view. 

Constraints play a central role in the world-view for 
expressing semantic information. We show how this seraan- 
10 tic information is used for efficiently answering queries 
further on. In principle, our world- view allows constraints to 
be expressed using any domain where implication {i.e.. 
subsumption) reasoning can be done efficiently. For order 
constraints, implication reasoning can be done in 
15 polynomial-time (see Ullman. supra). Subsumption reason- 
ing in CLASSIC can also be done in polynomial-tune (see 
A. BorgidVand P. F.* Patel-Schneiderr-^A semantics and- 
complete algorithm for subsumption in the CLASSIC 
description logic** Journal of Artificial Intelligence 
20 Research. 1:277-308. June 1994.) 

EXAMPLE 5.2 

Consider the airline flight application of Example 5.1. 
Woridvicw 115 in this case is naturally conceptualized by 
25 the following relations: 

quote(Agj\LSrcJDstCD). denotes that a travel agent Ag 
quotes a price of C to travel from Src to Dst on airline 
Al on date D. 

30 dir( Cust j\c. TelNo), gives the directory listing of cus- 
tomer Cust as area code Ac and phone number TelNo. 
areaCode(FlAc) gives the area code(s) associated with 
place PL. 

The world-view also has a rich type hierarchy of CLAS- 

35 SIC concepts describing, e.g.. various types of telephone 
customers. The concept customer is a primitive type that 
includes all telephone customers and specifically the disjoint 
subconcepts business and residential. 
Constraints are used to specify types of the attributes of 

40 the world-view relations. Far example, the attribute Cust of 
relation dir is constrained to be of type customer, the 
attribute Ag of relation quote is constrained to be of type 
travelAgent (a subconcept of business) and the attribute C of 
quote is constrained to have nonnegative values. □ 

45 Using CLASSIC in the World- View 

CLASSIC is a member of a family of description logic 
systems. There are several advantages to using a description 
logic system as part of the domain model component of a 
global information system. The key advantage is their ability 

50 to support extensibility and modifiabEity of domain model 
111. Although the world-view portion of domain model 111 
should be relatively stable, the dynamic nature of the infor- 
mation sources will unavoidably lead to changes in the 
information descriptions 113 and system/network view 117 

55 portions of domain model 111. (e.g., new specialized ser- 
vices often get created, transient discussion topics arise 
frequently, etc.). Even with world view 115. users may want 
to make a personal version of world view 115 by defining 
new concepts and relations, creating new objects, and assert- 

60 ing constraints about the world-view relations (e.g.. a user 
may want to define the set of universities with a researcher 
working on global information systems). 

A system such as CLASSIC supports extensibility by 
allowing new concepts to be created and automatically 

65 placed in the concept hierarchy. For example, suppose the 
concept hierarchy included the concepts business and 
ahiine__agent (defined as a subconcept of business that has 
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fillers "travel" and "airline" fox attribute business_type). If 
the user wanted to add a new concept trave_agent (defined 
as a subconcept of business that has a filler "travel" for 
attribute business_type). CLASSIC would automatically 
place this new concept in the concept hierarchy between 
business and airline_agent This would not be possible in 
object-oriented database systems that require the class hier- 
archy to be explicitly created by the user. 

A second advantage is that description logic systems do 
not require the user to explicitly specify all concepts to 
which an object belongs. Instead, such systems automati- 
cally classify objects in the appropriate concepts, based on 
the definitions of die concepts and the information available 
about the object. For example, suppose the concept hierar- 
chy included the concepts www_site and ftp_site (which is 
defined to be the subconcept of www_sate whose URL 

. ~~ - attribiite begins with the string ftp:). If the user creates an 

object as an instance of www_site with its URL as ftp:// 
research.att.conu then the system will also classify it as an 
instance of ftp_site; this classification is needed to use the 
appropriate protocol when accessing the site. Current day 
object-oriented database systems do not allow such auto- 
matic classification of objects. 

Description logic systems provide varying degrees of 
expressivity in their concept definition language. 
Consequently, they vary considerably in the complexity of 
subsumption reasoning (i.e.. does concept C x subsume con- 
cept Cz). CLASSIC stands out in this family as a language 
which has been carefully designed so that subsumption 
reasoning is in polynomial-time, while still being 
expressive, and has been used in large-scale conomercial 
applications. 

Finally, the most significant limitation of description logic 
systems is that their scale-up suffers in the presence of large 
collections of objects. However, this limitation does not 
impact on the use of CLASSIC in our world-view, since the 
world-view relations are not explicitly stored; information is 
explicitly stored only in the external information sources. 

The Query Language 

Many languages have been proposed for querying object/ 
relational databases Our world-view is also object/relational 
in nature, synthesizing the relational model with an object- 
oriented model. Hence, any query language proposed for 
object/relational databases can be used to query our world- 
view. 

In this paper, for simplicity of exposition, we consider 
only conjunctive queries of the form: 

Q(X):-C(7>,E 1 (X 1 ) t ,..,E 4 (X k ). 

□ 

The E/s are relation names from the world-view relations 
W. C is a constraint on the variables of the query, and X, 

Y. %x X, are constants, variables, or world-view 

objects. Constraints in queries are conjunction of order- 
constraints. 

EXAMPLE 5.3 

The following query retrieves the names and phone 
numbers of travel agents in Miami who sell tickets from 
Newark to Santiago on any airline for under $1000: 



queryCName^AC.TdNo) quoteCAg^M 'Newark, NT, 'Santiago, 
Chile*, CD), ax«»Code( a Miami, ¥VjiC\ dbfA&ACTelNo), 
nan^A&Nanie)jC<l00O. 
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This query does not explicitly make use of the world-view 
concept travelAgent since the type of Ag in the world- view 
relation quote is constrained to be the concept travelAgent 
□ 

5 Typically, languages for querying object/relational data- 
bases use SQL-like constructs to access attributes of 
relations, and "path expressions" to access attributes of 
objects. In our world-view, concepts can be viewed as unary 
relations, and object attributes can be viewed as binary 

to relations. Consequently, accessing object attributes using 
path expressions is equivalent to using a chain of unary and 
binary relations corresponding to concepts and attributes. 
For this reason, our queries are conjunctive relational que- 
ries expressed in terms of the world-view relations and 

t5 objects. 

Sites and Site Descriptions: FIG. 2 
- Users^pose.querieS'in terms of the relations - W. of world 
view 115. However, the world-view relations constitute just 
a conceptual view; the information required to answer 

20 queries is present in the external information sources 123 
described in information source descriptions 113. Informa- 
tion sources 123 can be viewed as providing extensions of 
site relations R from information source descriptions 113, 
which are type-set in mis font. In order to answer user 

25 queries, the system needs a precise description of the site 
relations R. Such a description is termed herein a site 
description. As shown in FIG. 2. a site description 201 in a 
preferred embodiment includes at least two types of infor- 
mation: 

30 a content specification 203 which relates the contents of 
the external relations R with the world-view relations 
W. 

a set ofquery forms 205 (Q..n) which indicates subsets of 
queri es on the relations R_mat_the_external_site is 
33 willing to an swer 

In a preferred embodiment there are two subsets of 
queries indicated by the query forms: those queries which 
the external site can answer at all and those queries which 
the external site can answer efficiently. We first present some 
40 examples of site descriptions 201 to illustrate specification 
of content and capability. We then formally describe the 
language used for content specifications 203. 

EXAMPLE 5.4 

45 

A travel information source provides directory informa- 
tion for travel agents in the relation traveL_dir(Ag,Ac. 
TelNo). Content specification 203 for this relation specifies 
that this relation contains telephone information about travel 

53 agents in the dir world- view relation, though not necessarily 
all travel agents. 

The query forms 205 for this travel information source 
specify that this source answers two kinds of queries: first, 
the information source provides an agent's area code and 

55 phone number, given a specific travel agent, and second, the 
information source provides all travel agents and their phone 
numbers, given an area code. This information source docs 
not answer queries where none of the arguments is bound to 
a constant 

60 The Manhattan directory information source provides the 
relation bigaprJe_dir(Cu$t TelNo)- The content specifica- 
tion 203 for this relation specifies that this relation contains 
the phone numbers of customers in the 212 area code. In 
addition, content specification 203 specifies that it has 

65 complete information about the phone numbers of customers 
in the 212 area code. i.e.. there is no phone number in the 
212 area code which does not exist in the relation bigapple_ 
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dir. Specifying completeness information is useful for a query processor to 
determine that it Deed doc query any other sources far iniomutikxi regarding 
212 phone numbers. See 0. Etzioni. K. Ooldea, tnd D. Weld. "Tractable 
ctascd world reasoning with updates", la Proceedings qffOt-9A, 1994. □ 

Details of Content Specifications 203 

A content specification 203 describes the contents of 
external site relations R by relating them to the world-view 
relations W. A content specification 203 thus has three parts: 
a right hand 211 which is a conjunction of expressions 
Involving relations in world view 115, a left hand 207 of 
expressions involving relations in information source 
descriptions 113. and a connector 209 between them In the 
site description language of the preferred embodiment a 
content specification may have one of the following four 
forms: 

CiJDJt&d R^)cC^3T).B<3t> (1) 

C^fTMt^), . . . , Ktfr^j/X), E (X) (2) 
C ir ^(X) S C^E 1 (X 1 ),... t B J ^ (3) 
C Jt (^(J^ jr (T>,E l <X 1 ),..-,B t (X i ) (4) 

The R's (with or without subscripts) refer to the external 
site relations, the E's (with or without subscripts) refer to the 
world-view relations, and the C*'s and C^s denote con- 
straints (order constraints and CLASSIC concepts). X (with 
or without subscripts) and Y denote tuples of variables 
and/or constants. Each expression must be range-restricted 
Le.. XcX A u . . . uXjf 

The meaning of an expression is the natural one. given by 
the following relational algebra expressions (where <r 
denotes selection, n denotes projection, and denotes 
join). For example, the meaning of content specifications of 
form (1) is: 

The meaning of content specifications of form (4) is: 

Expressions of the type (1) and (2) differ from expressions 
of the type (3) and (4) in the following way. The first two 
specify how fragments of world-view relations can be 
computed from the site relations. Lc, the world-view rela- 
tion fragments are akin to traditional views on the site 
relations and external database schemas in multidatabases. 
See W. Litwin. L. Mark, and N. Roussopoulos. "Interoper- 
ability of multiple autonomous databases". ACM Computing 
Surveys, 22(3)267-293. September 1990. In contrast the 
latter two define the contents of fragments of the site 
relations as views on the world-view relations. 

An expression of type (1) specifies that part of the 
fragment is computed using the description. An expression 
of type (2) specifies that all of the fragment is computed 
using the description. The relationship between expressions 
of type (3) and (4) is the same as the relationship between 
expressions of type (1) and (2). 

EXAMPLE 5.5 

Consider our airline flight application. Fly-by-Night Air- 
lines provides two site relations 207: fbn Jghts(Flt Src. 
Dest). which denotes that flight Fit of Fly-by-Night Airlines 
is from Src to Dest. and fbn_quote(Ag. Fit C, D). which 
denotes that a designated travel agent Ag of Fly-by-Night 
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Airlines quotes a price of C to travel by flight Fit on date D. 
The world-view relation 211 quote can be related to the 
contents of the site relations fbn_flights and fbn_quote 
using a content specification 203 of the form (1) as follows: 



fbn_flighis(Flt, Sic. Dest), fbaquore<Ag, Flu C, D)£quotc(Ag, 
■Fty-by-Nighr, Sic, Dest, C, D). 

This content specification 203 states that tuples in the 
relation quote can be computed by joining tuples in the 
relations fbn_flights and fbn_quote. 

Suppose that only the designated travel agents of Fly-by- 
Night Airlines were allowed to offer quotes on Fly-by-Night 
Airlines. Then, all the information about fare quotes for this 
airline is present in the relations fbn_fiights and fbn_quote. 
TWs^coirmkte information can be represented using Q con^ 
tent specification 203 of the form (2) as fallows:"' ~ ~ 



fbq_flights(Flt, Src, Dest),ft>D_qiiote(Ag, Fit, C, D>=quote(A& 
Try-by-Nighf, Sic. Dest, C» D). 

EXAMPLE 5.6 

25 Consider the external site relations described in Example 
5.4. The external site relation traveL_dir contains a listing of 
travel agents, though not necessarily all of them. This is 
specified using a content specification of the form (3) as 
follows: 

30 

trevcL-dir(A^ Name. Ac, TelNo)cdir(Ag, Ac. TctNo), trame- 
lAgent(Ag) tume(Ag, Name). 

This content specification 203 states that the site relation 
35 travel_dir already has a subset of the join of the world- view 
relations cfir and travelAgent. □ 

Our site description language does not allow content 
specifications 203 of the form: 



CjlDAflCi) ♦ . . . . R^teC*^ 

CgCbXfosCJfaEiG?*) E*(X*) 

45 Intuitively, these content specifications are not useful 
because they only provide information about tuples that are 
'"possibly" in the world-view relations, and not about tuples 
that are "definitely" in the world-view relations. The fol- 
lowing example illustrates mis point 

50 

EXAMPLE 5.7 

The external site relation contains a listing of the phone 
numbers of all travel agents as well as all insurance agents, 
55 The contents of this site relation can be specified using the 
content specifications: 



ta_U_jdii(Ag, Ac, TelNo)Qdir(A^ Ac, TelNo), travelAgent(Ag). 

U_U__diT(A£, Ac, TelNo)odir(Ag, Ac, TelNo), insur- 
anceAgeat(Ag). 

Without any means of distinguishing which number in 
this site relation is the phone number of a travel agent, and 
65 which is the phone number of an insurance agent this site 
relation is not useful in answering queries on the world- view 
relation travelAgent 
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□ In other embodiments, site descriptions 2#1 may include 

Specifying Query Forms 205 other useful information such as the cost and reliability of 

Information sources in global information systems are accessing tuples of the site relations. Incorporation of these 

autonomous and for reasons such as security or privacy. into the site description language requires the development 

may decide to answer only a subset of the possible queries 5 of algorithms mat can use this information effectively in 

on the site relations. In our site description language, each query evaluation. 

information source can specify the subset of queries it is Query Evaluation 

willing to answer using a set of query forms 205 on the site Users of a global information system 101 formulate 
relations provided by the information source. For details on queries in terms of relations in world view 115. without 
query forms, see J. D. Ullman. Principles of Database and io regard to the location and distribution of this information. 
Knowledge-base Systems, Volumes I and IL Computer Sci- However, the world- view relations are not explicitly stored; 
ence Press. 1989. all the data that are needed to answer these queries reside in 

Intuitively, a query form 205 m* on a k-ary relation R is site relations in external information sources 123. It is the 
a string of length k. using the alphabet {bi }. A V in the i'th task of the query evaluation system to access these external 
position indicates that the i'th argument of R must be bound 15 site relations and answer the user's queries. Since the cost of 
to a constant in a query conforming to m*; an T in the i'th accessing an information source over the network is 
-position indicates that the i'th argument of R can either be - ■ significant the main optimization to be performed is to 
free or be bound to a constant. An information source is minimize the number of external information sources 123 
willing to answer a query on a site relation if and only the that need to be accessed in order to answer the query. In this 
query bindings match one of its query forms. 20 section, we present several techniques that make effective 

use of site descriptions to nunimize access to external 
EXAMPLE 5.8 information sources. 

Consider the external information sources of Example ^^° g Q ueries ; 3 . 
5.4. The travel information source specifies the surWof „ ^ST', base system typicaUy has two 

queries on relation traveler that it idling to answer as 25 S^ 8 ^"?* .T^kT^ *f ^ md 

follows- executing this plan. In traditional database systems, a query 

plan specifies the order of computing the joins of the 
database relations in the query and the techniques used for 
poeaible_qi>crica: travel_dirfb f f, f b fj. cach of me J oins - This requires that each of the database 

30 relations mentioned in the query be either stored explicitly. 
The query form 205 b f f indicates that given a specific or computed on demand. Since the world-view relations in 
travel agent the information source can provide the agent's a global information system are not stored explicitly, the 
area code and phone number. The query form 205 f b f query plan has to compute the tuples in the world-view 
indicates that given an area code, the information source relations from the tuples in the site relations, 
can provide the travel agents and their phone numbers in that 35 Our algorithm for generating a query plan is shown io 
area code. □ . FIG. 3. Algorithm 301 operates after a join order for the 

Often it is the case that some of the queries that an query has been determined using traditional techniques, 
external information source is willing to answer can be Algorithm 301 creates sub-plans for evaluating each of the 
answered efficiently, because of clustering of tuples in the conjuncts in the query. It does so by determining which 
site relations, availability of indices, etc Answering queries 40 external information sources need to be queried in order to 
in a global information system can be optimized if this obtain tuples of a world-view relation E(W) that satisfies 
information were available to the query processor. Hence. some constraint C(W) (which is statically computed from 
our site description language also allows external informa- the query). Our algorithm assumes that each external site has 
tion sources to specify the subset 215 of queries that it can the capability of answering any query form. The algorithm 
answer efficiently* again using query forms 205. 45 can be straightforwardly extended, using the techniques 

described in K. A. Morris. "An algorithm for ordering 
EXAMPLE 5.9 subgoals in NAIL!". In Proceedings of the ACM Symposium 

~ A _ . „ ^ . . _ . on Principles of Database Systems, pg. 8-88. March 1988, 

Consider our airline flight application, and the travel . . ..J M ' „ . . . . rfe . . 

, * . ^ ' . ^. _ to handle cases when only certain query forms can be 

information source which provides the site relation travel . . _ . J , M \ . .„ ~^ 

.. . uui pwmw uic iuciwouuu u«yu_ 5o answered, or when certain query forms can be handled more 

dir. This source is willing to answer queries matching either efficiently 

of the query forms b f f andf b f (see Example 5.8). These j^J^ 301 gcncratcs a plan that is guaranteed to be 
query forms thus make up the set tf permitted queries 213. ^ ^ ^£>y executing this plan are 

"r^?^^ 8 qUC ?" 1 T^ Un8 bffnU ^ ltbcC f indeed answers to the query. If all content spedfications are 

aent because of the avadatality of a pnmary index on *e of the fonm(l)« (2) execuuiig the pUn Unguaranteed 

travel agent attribute, while answering queries matching f b tn „„_„,. .„ *„ »r«.. „,.-J ;„ 

f mi** be quite Inefficient becauTofW absence c?any afso^L * ** *"* ^ 

clustering in the site relation travel dir. The subset 215 of - . ^ . . 

5 uiv I™ ua^Mu, i.* 5uu»vi*w vi However, since algorithm 301 tries to answer each con- 
= that can be efficiently ans^red by the travel infor- junct ^ ^ m ™ £»S£Si 

mation source can be specified as follows: m ^ fmea ^ ct 1ftdaiaa * u of mc foms as ms , 

trated by the following example. 

ei^deiiL_qaeries: tmveL4iifbffI. EXAMPLE 5.10 

Of course, the access plan would first attempt to use the Consider a query that retrieves names and telephone 

efficient queries provided by information source 213 to 63 numbers of travel agents in the 212 (Manhattan. New York) 
answer the query, and would specify an inefficient query co ^ c - 

only if there were no other way to obtain the information. quertfName, TdNo):- travelAgent(Ag), dir<Ag. 212. TcWo\ 
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QamefA&Name). 

Suppose that the site relation nyTA precisely has the names 
and telephone numbers of all the travel agents in the 212 
area code, specified using the following content specifica- 
tion: 



ayTACName, TelNoHmvelAgeQtCAg), <tir(A& 212, TelNo), 
vaiat(Ag, Name). 

The answer to the query can be computed by using just the 
tuples in the external site relation nyTA. However, our 
algorithm would not be able to determine that the site 
relation nyTA is useful, since it would try to separately 
compute the tuples in the world-view relations travelAgent 
dir and name, and the nyTA site relation does not have the 
variable Ag. which is present in each of the three world- view 
relations. □ 

A complete strategy for answering queries in the presence 
of content descriptions of the forms (3) and (4) requires 
solving the problem of answering queries using materialized 
views. A general solution to this problem which works for a 
large class of query languages is described in the next 
section. The work on the general solution resulted in a 
demonstration that answering queries using materialized 
views (even when the query and the views are just conjunc- 
tive queries) is NP-compiete. whereas algorithm 301 pre- 
sented here is in polynomial time. 

A key aspect of algorithm 301 is that it generates a plan 
that accesses only information sources that can possibly 
contribute to answering the query, given the static con- 
straints in the query and in the site descriptions. 
Furthermore, we can extend algorithm 3#1 to cases in which 
both the query and the content specifications 203 of the form 
(1) and (2) involve aggregation, negation and recursion, 
using techniques described in A. Y. Levy and Y. Sagiv. 
"Constraints and redundancy In Datalog**, In Proceedings of 
the Eleventh ACM Symposium on Principles of Database 
Systems, San Diego. Calif.. June 1992; A. Y. Levy, L S. 
Mumick. Y. Sagiv. and O. Shmueli. ''Equivalence, query- 
reachability and satisfiability in Datalog extensions' 1 . In 
Proceedings of the ACM Symposium on Principles of Data- 
base Systems. Washington, D.C.. 1993; and A. Y. Levy, L S. 
Mumick. and Y. Sagiv. "Query c»ptimization by predicate 
move-around". In Proceedings of the International Confer- 
ence on Very Large Databases, Santiago, Chile, September 
1994. 

Answering Queries using Materialized Views 
Answering a query using materialized views can be done 
in two steps. In the first step, containment mappings from the 
bodies of the views to the body of the query are considered 
to obtain re writings of the query. The appropriate view 
literals for the rewriting are added to the query. In the second 
step, redundant literals of the original query are removed. 
Once this is done, evaluation of the query is done using one 
of these new versions which is cheaper to evaluate than the 
original query. The following discussion begins with some 
preliminary definitions and a running example and then 
presents detailed descriptions of the two steps. 

Pre liminari es 

In our discussion we refer to the relations used in the 
query as the database relations. We consider conjunctive and 
unions of conjunctive queries (i.e.. datalog without 
recursion). In addition, queries may contain built-in com- 
parison predicates (=. <and S). We use V.V^ . . . V m to 
denote views that are defined on the database relations. 
Views are also defined using queries. Given a query Q. our 
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goal is to find an equivalent rewriting Q' of the query that 
uses one ox more of the views: 
Definition 5. 1 : A query Q' is a rewriting of Q that uses the 

views V=V l V m if 

5 Q and Q' are equivalent (i.e., produce the same answer for 
any given database), and 
Q' contains one or more occurrences of literals of V. 
We consider only rewritings that have the same form as 
the original query (Le.. they do not use a more expressive 
10 query language than the original query). 

We say that a rewriting Q* is locally inimmal if we cannot 
remove any literals from Q' and still retain equivalence to Q. 
A rewriting is globally minimal if there is no other rewriting 
with fewer literals. 1 
^ 1 Note that we do not count literals of built-in predicates. 

— EXAMPLE 5.11 - *— - — 

Consider the following query and view: 



q(X\U>- tfX,Y), poOO), P| (X.W), p,(W,U). 
v(A3)> p(A,0, Pi(AJ>) 
25 The query can be rewritten using v as follows: 



q(X,U>. V0CZ), p l (X,W),p 2 (W,U). 

Substituting the view enabled us to remove the first two 
literals of the query. Note, however, that although the third 
literal in the query is guaranteed to be satisfied by the view, 
we could not remove it from the query because the variable 
W also appears in the last literal | □ 

Clearly, we would like to find rewritings that are cheaper 
to evaluate than the original query. The cost of evaluation 
will depend on many factors which differ from application 
to application. In this paper we consider rewritings which 
reduce the number of literals in the query, and in particular, 
reduce the number of database relation literals in the query. 
In fact we will show that any rewriting of Q that contains 
a minimal number of literals is isomorphic to a query that 
contains a subset of the literals of Q and a set of view literals. 
Although we focus on reducing the number of literals, it 
should be noted that rewritings can yield optimizations even 
if we do not remove literals from the query, as illustrated by 
the following example. 

EXAMPLE 5.12 

so Using the same query as in Example 5.11. suppose we 
have the following view: 

55 We can add the view literal to the query to obtain the 
following rewritten query. 

q(XU>- wQClvCXn PoOCZX Pl (X,W), p 2 (W,U). 

60 

The view literal acts as a filter on the values of X that are 
considered in the query. It restricts the set of values of X to 
those that appear both in the relation p and p r | □ 
In some applications we may not have access to any of the 
65 database relations. Therefore, it is important to consider the 
problem of whether the query can be rewritten using only the 
views. We call such rewritings complete rewritings: 
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Definition 5.2: A rewriting Q 1 of Q. using V=V, V m Note that the variables of 7 are the only ones that may 

is a complete rewriting if Q* contains only Literals of v and appear in both the pXU,) and the r/V^). 
built-in predicates. □ Given the mapping II there is a natural containment 

mapping from Rule (8) into the original rule for q (Le., Rule 
EXAMPLE 5. 13 5 (5)) mat is defined as follows. Each subgoal p,<U,) is mapped 

Suppose that in addition to the query and the view of t0 itself and each ™ b Z oal r /*V *» ma PP ed to the same 
Example 5.11 we also have the following view: wtogoal of Rule (5) as in the containment mapping h (from 

Rule (f3) to Rule (5)). We will denote this containment 
mapping as 0 . The following is an important observation 
V2<A3):- Pi(a,c>, pj(C3)* po(D£)- 10 about 0: The containment mapping 0 maps each variable of 

Y to itself. 

The following is a complete rewriting of q that uses v and r^h su bgoal p/TJ,) of Rule (5) is the image (under <» of 
' u 2 : itself, and maybe a few of the t/Vj) literals. We say that the 

literals r/V,) that map to pXU,) under 4> are the associates of 

ncin vx7v rx in 15 1V(IJ|)' For ^ rest °* tne discussion, we choose arbitrarily 

qi h " x ^ ^ h one of the associates of p/Uj and refer to it as me associate 
" It is important to note that mis rewriting cannot be k of pXU,)~Notefoatif h^ 

achieved in a stepwise fashion by first rewriting q using v subgoal in Rule (5). then each p^TJ,) will have at most one 

and then trying to incorporate d 2 (or the other way around). associate. 

Finding the complete rewriting requires that we consider the 20 Before we define the set of redundant subgoals. we need 

usages of both views in parallel. | □ the following definition: 

Finding Redundant Literals in the Rewritten Query Definition 5.3: A subgoal r/Vy) covers a subgoal pXU ( ) if 

In this section we describe a polynomial algorithm for the *U of the following hold, 

second step. Given mappings from the views to the query. Hie subgoals i/V^) and pXtT f ) have the same predicate, 

the algorithm determines a set of literals from the query that 25 ff p ^ ^ ft < a stinguished variable (or a constant) in 

can be removed We show that under certain conditions there some argument position a. then iff.) also has that 

is a unique maximal set of such literals and the algorithm is variable (or constant) in argument position a. 

guaranteed to find them. In other cases, the algorithm may w OWrtl ^ T%t a ar ,A a ~c ~rrT\ h 

it removes arc guaranteed to be redundant. a»d therefore the 30 Q argument portions a,, and a 2 ofr,<V,). 

algoriftm is always applicabfc Note that fa such cse^tbe ^ ^ of r{(hmdam mQwfllbethe complement 
rest or tne query can aw be nnnimizeil using Known of me needM Utelals n> defined 

as follows: 

techniques Together with an algorithm for enumerating Definition 5.4: The set N is the minimal set satisfying the 
mappings from the views to die query, our algorithm pro- followi four *"* 
vides a practical method for finding rewntings. For 35 « 4 , * ^, . , , 

simplicity, we describe the algorithm for the case of re writ- 1. All the p ( <Ut) that do not have associates are in N 
ing using a single occurrence of a view. 2 - ff r / V " associate of p/U,) and r/V,) does not 

Suppose our query is of the form cover p^IT,), then pffid is in N. 

3. Suppose that all of the following hold 
40 Subgoal pXU/) has the variable T in argument position 

and we have the following view: ^ as . sodate of P^> te variable2 H m *g"™nt 

^* position a r 

The variable H is not in Y (hence. H appears only 

v<Z> . . . , uW„> (6) 45 among the t/Vj)). 

The variable T also appears in argument position of 

Let h be a containment mapping from the body of v into the Pi(ff i ). 

body of q. and let the following be the result of adding the The associate of p/TJ,) does not have H in argument 

view literal to the query: position a^ 

50 Then p^U,) is in N. 

q^-Pitfi) pJV^vP), (?) ^ifp^imeniposiiiQD. 

where Y=h<2). Note that we can restrict ourselves to map- 4 - S«Pt»se mat p/U^is in N and that variable T appears 

pings where the variables of Y already appear in the pX 10 ^ ff P<( U <> has vanablc T m ^gument position 

V t ). To obtain a minimal rewriting, we want to remove as 55 aa ? d lts associ^ does not have Tin argument position 

many of the p, literals as possible. *• fecn WW is also in N. 

To determine the set of redundant literals, consider the EXAMPLE 5.14 
rule resulting from substituting the definition of Rule (6) 

instead of me viewUteralinRute(7).Thatis.wereiiamcthe Consider the query and the view of Example 5.11. The 
variables of Rule (6) as follows. Each variable T that appears 60 r «ult of substituting the view in the query would be the 

in Z is renamed to h(T), and each variable of Rule (6) that following: 
does not appear in Z is renamed to a new variable (that is not 

already among the pXU,). Let the following be the result of q(Xl u>. xvn p^TOp.PCW), ft( w, W ^ Pl( x 

mis substitution. nv 

63 h 

The literal p^CW.U) is needed because it does not have an 
qflfr- Pi(Pi\ • ■ • ► vJPm)* associate. ITie literal p,<X.W) is needed by condition 4 in the 
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definition, because its associate p x (X.D) does not contain the 
variable W (which appears in pi W.U)). Consequently, these 
two literals need to be retained to obtain the minimal 
rewriting. □ 

Further details and the proofs of complexity may be found s 
in A. Y. Levy, A. O. Mendelzon. Y Sagiv, and D. Srivastava. 
"Answering queries using views**, will appear in Proceed- 
ings of 14th Symposium on Principles of Database Systems, 
San Jose. Calif.. May 22-25. 1995. 
Using Completeness Information 10 
In generating apian for answering a query, algorithm 301 
accesses all (and only) sources that may contribute to 
answering the query. While this may be necessary in general, 
there are many cases where a small subset of the relevant site 
relations contains all the information needed to answer the is 
query. Since completeness information of single sources can 
be -expressed in the content specification 203 (using -speci- 
fications of the forms (2) and (4». the query processor can 
effectively use these forms of content specification 203 to 
ignore redundant sites. 20 

EXAMPLE 5.15 

Consider the airline flight application. Let the site relation 
ta_dir contain listings of all travel agents in the U.S. and let 
the site relation bigapple_dir contain listings of all tele- 25 
phone customers in the 212 area code. 

Accessing both these site relations is redundant in order to 
answer a query that asks for the phone number of a specific 
travel agent in the 212 area code, although both these site ^ 
relations are relevant to answering this query. Querying 
cither of these two site relations suffices. 

Both these site relations are also relevant to answer the 
query that asks for the phone number of a specific travel 
agent (without knowing the area code of the travel agent). 35 
However, querying ta_dir is sufficient in this case, though 
querying bigapple__dir may not be sufficient □ 

Intuitively, we use content specifications of the form (2) 
as follows. Given that we are trying to compute tuples of a 
world-view relation E that satisfy the constraint C, we search 40 

for a minimal set SD t SD„ of content specifications 

205 which together can be used to compute all the tuples of 
E that satisfy C. Formally, the algorithm for doing this is the 
following. 

Suppose we are trying to compute the tuples of E(W) that 45 
satisfy the constraint C(W). Our algorithm chooses a set 

SD £ =<SD 1 SD„) of content specifications of the form 

(2): 

50 

CVfP), R/(X/KV(W>, BfW) 

for lSj £n such that: 
C(W) =>C^(W)V ... V C/(V7). 

There is no subset of SD E that satisfies the first property. 55 
If such a set does not exist for C(W). then let C(W) be the 
weakest constraint for which such a set does exist (The 
constraint C(W) can be obtained by conjoining C(W) with 
the disjunction of the C £ *s of all content descriptions of the 
form (2).) The tuples of E(W) that satisfy the constraint C"( 60 
W) can be computed using content specifications 205 of the 
form (2). as above. Furthermore, let C*(W) be C(v7)\C( 
W). The tuples of E(W) that satisfy the constraint C**can be 
computed using the other content specifications 205. as 
described in Algorithm 301. 65 

Although the above algorithm is not a polynomial time 
algorithm (even for order constraints), the complexity of the 
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algorithm is in the size of the representation of the query 
constraints and the site description constraints, not in the 
size or number of the site relations. 
Dynamic Query Plans 

In traditional database systems, the plan execution comes 
strictly after the query is optimized and the complete plan for 
evaluating the query is generated. Although such a static 
query plan is adequate for traditional database system 
applications, global information systems require dynamic 
plans, where the query plan generation phase interacts with 
the plan execution phase. The following example illustrates 
the benefits of postponing generating plans for sub-queries 
until run-time, when values are known for some of the query 
variables. 

EXAMPLE 5.16 

Consider " the T'ahWe fli^t * app 'The foUowmjgr 

query retrieves the telephone numbers of travel agents in 
Manhattan. New York: 



qucryf AC, TelNo): - ueaCodefManhattan, NY\ AC), treve- 
lAsentCAg), dir(Ag, AC, TWNo). 

The constraint travelAgent(Ag) present statically in the 
query entails that directory information sources that do not 
contain listings of travel agents are irrelevant to answering 
the query. However, in the absence of knowledge about 
tuples in the world-view relation areaCode (which are 
computed only at run-time), the query plan would have to 
treat all other directory information sources (e.g.. the one for 
the 908 area code) as relevant to the query. 

However, once the sub-query areaCode( ' Manhattan . NY\ 
AC) is evaluated, the bindings for AC (in this case just 212) 
can be used to restrict the set of relevant directory informa- 
tion sources to only those with area code 212. □ 

To be able to perform such optimizations, it is necessary 
that we pass sideways values computed for some of the 
query variables to create or modify segments of the query 
plan dynamically. Le.. at run-time. The following example 
illustrates the optimization benefits of passing not just values 
of the query variables, but also additional information 
obtained at run-time, 

EXAMPLE 5.17 

Suppose that unitedAgent and americanAgent were dis- 
joint subconcepts of the concept travelAgenU Le.. no travel 
agent is both an agent for United Airlines and for American 
Airlines. Assume that the United Airlines information source 
provides a directory service for United Airlines agents 
ua_dir(Ag, AC. TelNo). and American Airlines provides a 
directory service for American Airlines agents aa_dir(Ag. 
AC. TclNo). The content specifications 205 for these site 
relations are as follows: 

iuL_«gents(Ag, AC, T^INo)GumtodAgent(Ag)4ir<Ag, AC, TelNo). 
aa__dii(Ag, AC. TcINo)oniwkanAgeiii(As\dirf A*. AC. TelNo). 

Consider now the following query that retrieves the 
telephone numbers of award-winning travel agents (a sub- 
concept of travelAgent). 

querytAC.TelNb):- awanfr*avelAgent(Aa),dir(Ag, AC, TelNo> 

If a binding for awardTraveLAgent(Ag) was found at a site 
that only had information about United Airlines agents, this 
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information could be used to determine that the site relation 
aa__dir is irrelevant for answering the query, therefore 
showing that knowing the source from where the binding for 
Ag was found can be used to prune the directory sources 
where no matching listing would be found. □ 

The above examples illustrate the two key features of 
dynamic query plan generation: 

1. Postpone planning for sub-queries until run-time* when 
sufficient information is available to determine a small 
set of relevant sources. 

2. Pass additional information obtained at run-time, not 
just values of query variables, to the query optimizer. 

We have identified two additional pieces of information 
that are very useful for pruning information sources, and 
which can be easily determined from the site descriptions, 
and passed in the binding information for query variables: 
~(-l) the type of the value, and (2) the location where the value- 
was found. Details concerning the information and how to 
use it in an algorithm for dynamically generating a query are 
presented below. 

A second reason for supporting dynamic query plans in a 
global information system is that when the external infor- 
mation sources are distributed over a computer network, it 
is quite likely some external sources are unavailable when 
required. In the presence of alternative information sources 
that can provide the same information (because of redun- 
dancy in the autonomous information sources), the query 
plan must be dynamically modifiable. 

Types of Information which are Useful in Dynamic Query 
Generation 

The following discussion provides details about the selec- 
tion of information which is useful in dynamic query gen- 
eration. The discussion is based on Craig A. Knoblock and 
Alon Levy. '"Efficient Query Processing for Information 
Gathering Agents**, to appear in working notes of the 1995 
AAAI Spring Symposium on Information Gathering in 
Distributed and Heterogeneous Environments, available 
from AAAI. In the following. C, Q etc. denote classes in 
domain model 111. Binary relations among objects in 
domain model 111 are represented by roles (denoted by u t 
etc). The discussion also employs a running example in 
which system 101 has received a query concerning the 
publications of Ron Brachman. who is a researcher in 
artificial intelligence at AT&T Bell Laboratories. 

An information source 123 s can be viewed as providing 
some knowledge about a class in the domain model C,. It 
can either provide some or all of the instances of the class 
C r In the latter case we will say that s is a complete source. 
The source s also provides some role fillers for the instances 
it knows about Formally, s provides the role fillers for the 
roles r/ , . . . r/. For each role, s may provide all the fillers 
or only some of them. Hie information about which class 
and roles s knows about it is contained in information source 
description 113 for s. 

We can now describe the kinds of information mat can be 
obtained by system 1#1 at run time and how they can be 
used The first set of information types (called domain 
information) include information about the class hierarchy 
and individuals in those classes. Specifically, we have iden- 
tified the following types of information: 
Membership An individual being a member (or not a mem- 
ber of a class), for example. Ron Brachman being an 
instance of Al-researcher. 

Fillers One or more individuals filling a role of another 
individual (or not being a filler of a role), for example, that 
the affiliation of Ron Brachman is AT&T Bell Labs. 
Size The size of a class or the number of fillers of a role. 



Constraints High level constraints on classes or fillers of 
roles (e.g.. all fillers are in a certain range). 
Relationships Relationships between different classes or 
roles (e.g.. one class contains another). 3 

5 3 Note that inteosional subeumption relationships between classes we can be 
inferred in the domain modcL This class of information refers to extensional 
comaiiirnentrelafooship&.e^ are also 

instances Cj. 

The second set of information types (called source infor- 
mation) are like the above types, but concerns knowledge 
io about information sources, and not about the domain mod- 
el's class hierarchy: 

Membership An individual being found in an information 

sources (or not being found there). 

Fillers One or more individuals filling a role of another 

15 individual in a specific information source. 

Size The number of class instances found in a specific 

- information sources. , ~ . . - - — «. 

Constraints High level constraints specific to an information 
source (e.g.. an information source only contains Bell Labs 

20 researchers). 

Relationships Relationships between different classes or 
roles (e.g.. source s l containing all the data in source s 2 ). 

It should be noted mat in some cases the domain infor- 
mation can be inferred from the source information, and the 

25 description of the sources. 

Using the Information to Optimize Queries 
There are several ways in which the information types 
outlined above can be used to optimize queries: 
Membership Membership information can be useful in 

30 identifying an information source that is likely to contain 
additional information. If we found the individual a in 
source s. and a subsequent subgoal asks for the filler of a role 
r of a. we will first check whether s contains fillers for r 
(which will be known in the description). Note that this type 

35 of information is especially useful because typically infor- 
mation sources will only have part of the instances of a class, 
and therefore, finding an instance in a given information 
sources is a significant piece of information. 
Fillers Information about specific fillers for roles can be used 

40 to constrain the queries to other information sources. For 
example, if we learn the area code for Bob Jones from one 
information source, then it can be incorporated into the 
query sent to another information source. 
Size Size information about classes and intermediate results 

45 is useful in ordering subgoals in a query. Traditional query 
processing systems estimate sizes before processing starts, 
but using actual size information may be critical when good 
estimates are unavailable. 

Relationships The main use of additional domain model 
50 information is to rule out possible information sources. 
Knowing that an individual belongs to a more specific class 
that can be inferred from the query enables us to limit the 
number of sources considered in later subgoals of the query 
that contain the individual as a binding. For example. 
55 knowing that Ron Brachman is an AI researcher enables us 
to focus on paper repositories mat provide AI publications. 
Knowing that he is an AT&T employee provides a justifi- 
cation for considering first a paper repository from AT&T 
researchers. 

60 Constraints Domain-level constraints can be used by propa- 
gating the restrictions from one subgoal to the next This is 
similar to some of the reformulations done with semantic 
query organization, except that the constraints arc identified 
dynamically instead of using precompiled information. 

65 Completeness Completeness information about a class (or 
the fillers of a role) enable us to stop searching for more 
instances of the class (or fillers of that role). 
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Obtaining Domain and Source Information amongst these plans at run-time. Practically speaking, the 
A second dimension along which dynamic query processing large number of information sources makes this approach 
methods differ is the way that the domain and source quite infeasible. and our algorithm creates plans for seg- 

inforraation are obtained: ments of the query at run-time. 

Information can be found by simply solving subgoals in s Access Plan Generation and Execution 119 in a Preferred 

the query. Instead of recording only the values of the Embodiment: FIG. 5 

bindings that are found in solving a subgoaL we can in order to Implement algorithm 491. access plan genera- 

also record the information sources in which they are tion ^ execution component 119 of system 101 must be 

found. Additional domain knowledge can be inferred modified as shown in FIG. 5. Component 119 has two 

from the description of the information source in which l0 subcomponents: query plan generator 509 and query plan 

the binding was found. For example, if Ron Brachman executor 519 Query plan generator 509 responds to an 

was found in the AAAI-fellow information source, then information access description 501 from KBS 109 which 

we can infer that he is a member of the class AAAI- cont ains site descriptions 201 by generating a query plan 511 

fellows, which is a subclass of Al-researcher. If Brach- which is made up of a number of subplans 5 12. Each subplan 

man was not found in an information source that 15 512 is sent in turn to query plan executor 519. Query plan 

contains all physics researchers, then we can infer that executor 519 executes the current subpian 512 by producing 

. hcis not ,a. physicist, Details, of this. technique are subquery protocol 525 for querying the informatiem-source- 

presented below. 123 specified in current subplan 512. When the protocol is 

Information about a binding can be found in the process executed, it returns subquery results 523 and additional 
of trying to solve the subgoal that needs the information M information 517 to query plan executor 519, which retains 
. For example* we may begin considering a few paper subplan results 523 and returns additional information 517 
repositories to find Brachman 's papers, and by doing so to query plan generator 509. which then prunes the remain- 
figure out that he is a member of Al-researcher class. in g subplans 512 on the basis of the additional information. 
This will enable us to prune the subsequent paper when all of the necessary subplans have been executed, the 
repositories we consider. 25 retained subquery results 523 go to graphical user interface 

Information gained in solving previous queries can be 193 as query results 521. 

used. The challenge here is to remember from previous in a presently-preferred embodiment, the additional infer- 

queries only information that may be relevant in future mation is treated as a constraint which applies to subplan 

queries, and will not change rapidly. result 523. That constraint is then applied to the concept for 

Finally, an the information agent can create new subque- 30 which the subplan was retrieving instances. If query plan 

ries in order to actively seek information about bind- 511 has unexecuted subplans 512 which include that concept 

ings. For example, by considering the descriptions of and a constraint which is mutually satisfiable with the 

information sources providing paper repositories, the constraint defined by the additional information, those unex- 

agent can determine that knowing the affiliation and ecuted subplans 512 may be pruned from query plan 511. 

field of an author dramatically reduce the number of 35 Improved User Interface 

relevant information sources. Therefore, the agent may The following is a detailed description of the improve- 
first pose a query looking for Brachman* s field and ments in the user interface 103 of system 101. The improved 
affiliation, before solving the query. user interface 103 is described in conjunction with FIGS. 6 
Algorithm for Dynamically Generating a Query Plan FIG. through 8. 
4 40 The improvements in user interface 103 are described in 
In overview, the algorithm shown in FIG. 4 works by connection with one embodiment of the invention in which 
using type information received from information source the information retrieval system 101 is a WWW client Thus, 
123 to prune the sub-plans used to compute the tuples for the this detailed description will begin with a brief description 
rest of the query. In detail, algorithm 401 for dynamically of hypertext navigation and interpretation of hypertext links, 
generating a query plan first determines a join order using 45 which are operations common to all interactive WWW 
traditional techniques. Then, algorithm 401 operates In two clients. 

phases when evaluating each conjunct in the query. In the As shown in FIG. 6. the Improved user interface 103 

first phase 465, algorithm 401 uses the known bindings for includes a hypertext browser 602 that supports the presen- 

the query variables to generate a sub-plan for evaluating the tation of, and interaction with, hypermedia WWW docu- 

conjunct In the second phase 407. algorithm 401 accesses 50 ments. Upon retrieval of a hypertext document by the system 

the relevant iiiformation sources and generates new bindings 101. the hypertext browser 602 formats and displays the 

for the query variables using type information received from document as a mixture of text 604. graphics 606 and 

the relevant information sources. The type information hypertext links 608. The displayed hypertext links 608 have 

appears in algorithm 401 as C R SD 409. that is, a constraint a different appearance <&g. different color, underline, italics) 

on the external site relation. In other embodiments, infor- 55 to distinguish them from the rest of the text in the document 

mation other than type binding information may be used. The hypertext browser 602 allows user interaction with 

Algorithm 401 alternates between phase 405 and 407 until these hypertext links 608 by attaching semantics to the 

each conjunct in the query has been evaluated, and the query action of selecting a hypertext link with a graphical pointing 

answered. Although algorithm 401 chooses a join order at device, such as a mouse, and performing a gesture, such as 

compiie-time, it is straightforward to extend the algorithm to 60 depressing the mouse button. Since the hypertext link 608 

use the binding information to decide on a join order represents another information source, the result of selecting 

dynamically. a hypertext link is to retrieve the object associated with the 

It is important to stress that all the type information 409 link. Such an object may be another hypertext document or 

that algorithm 401 uses for optimizing queries at run-time is some other media type like sounds, images, or movies, 

available statically in the query and the various site descrip- 65 We use the term information source broadly to describe a 

tions. In principle, it is possible to generate all possible variety of entities that convey some type of information. A 

query plans at compile-time and merely choose from particular specialized type of information source is a single 
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document In the following detailed description we will 
sometimes refer to documents as specific, commonly used 
instances of information sources. These documents may be 
hypermedia documents that include graphics* audio, 
animation, and hypertext links to other information sources. 
Other examples of information sources include collections 
of documents (e.g. directories or databases) and information 
servers that provide access to collections of other informa- 
tion sources. 

The hypertext link 608 displayed by the hypertext 
browser 602 has an associated Universal Resource Locator 
(URL) that encodes the location and access method for the 
document to be retrieved. To process a retrieval operation, a 
link interpreter 130 (FIG. 1) decodes the URL to determine 
how to connect to information sources 123 and request the 
document The first part of the URL encodes the protocol 
that is used -to communicate with the server on which the 
document resides. The second part of the URL is the 
network name or network address of the server. The remain- 
der of the URL is the pathname or query that uniquely 
identifies the document to the server. Having determined the 
communication protocol, the link interpreter 130 passes the 
server name and pathname or query that refers to the 
document to the appropriate information access protocol 
interface 121. Each information access protocol interface 
121 implements a single network protocol for establishing 
communication with the server and retrieving the document 

Upon successfully retrieving a document it is interpreted 
and formatted for display in the hypertext browser 602. 
Interpretation of the document includes identification of 
embedded hypertext links, so that the hypertext browser 602 
can display these links with the visual indications and 
interactive behavior described above. 

The above described hypertext navigation, interpretation 
of hypertext links, and document retrieval based upon hyper- 
text links is well known and could be readily implemented 
by one of ordinary skill in the art 

In one embodiment of the present invention, the user 
interface 103 is connected to the CLASSIC knowledge 
representation system (knowledge base) 109 (FIG. 1). which 
is the medium for storing information source descriptions. 
The system 101 uses information source descriptions 113 to 
represent information sources. These information source 
descriptions 113 are represented by the system in terms of 
knowledge base 109 objects. An information source descrip- 
tion is composed of relevant attributes of an information 
source. The information source description can be used to 
query the knowledge base 109 and to permit access to and 
retrieval of the information it describes. Specific examples 
of the attributes included in information source descriptions 
include properties such as the type of information (e.g. 
formatted text, graphical image), the size (content length) of 
the document, the time that the information was last 
modified, and the times that the information was accessed. 
These attributes can generally be determined with no under- 
standing of what the information is about In addition, the 
information source description includes attributes that rep- 
resent the semantic content of the information, such as a 
topic attribute that indicates what the information is about 
In general, attributes relating to the semantic content of the 
information require some understanding of the content of the 
information and may not be extracted fully automatically. 

This latter class of attributes, which indicate the semantic 
content of an information source, establish the relationship 
between information sources and concepts in the world view 
115. The world view 115 comprises concepts that are 
primarily meaningful to users. The most commonly used 
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concepts in the world view are the topics that are used to 
describe aspects of die semantic content of an information 
source. These topics are related to each other in a generali- 
zation taxonomy. The user will often wish to browse or 

5 query the knowledge base 109 in terms of these world view 
115 concepts (i.e. finding a set of information sources about 
a particular topic). These browsing/querying operations can 
take advantage of the taxonomic organization of the topic 
concept to progressively generalize or specialize an exami- 

io nation of the information sources represented in the knowl- 
edge base, and will be described in further detail below. 
Attributes related to extrinsic properties of information 
sources, such as network addresses and access methods, 
establish the relationship between information sources and 

is concepts in the system/network view 117. 

The CLASSIC knowledge representation system 109 has 

— been described in detail above, in conjunction- with- the — 
description from the parent application, and will be further 
described here only insofar as it relates to the improved user 

20 interface 103 of the present application. CLASSIC is a 
description logic-based system, operating in terms of 
structured, object-centered descriptions of concepts and 
their instances. CLASSIC performs inferences of subsump- 
tion and classification to automatically organize concepts 

25 into a generalization taxonomy, as well as classifying indi- 
vidual objects under all ap propr i ate concepts. It also pro- 
vides a rule mechanism for fcrward-diaiiiing deductions. 
The expressiveness of CLASSIC'S description logic is 
designed to ensure that inferences can be done with poly- 

30 nomial cost 

The CLASSIC knowledge representation system 109 
includes facilities for extending the knowledge base by 
adding to and refining the domain model 111. As new 
information sources are discovered and new information 

35 source descriptions are added to the knowledge base 109. the 
user's view of the world may change, so the system supports 
the addition of new concepts and relationships by providing 
a concept editor 708 (FIG. 7) (hat is invoked from the user 
interface 103. The concept editor 708 is instantiated in the 

40 lower right portion of the display screen as shown in FIG. 7. 
This area of the display screen is called the command 
window 622. The command window 622 is where a user 
enters textual commands mat cannot be expressed as pointer 
gestures on display objects. In addition, many of the pointer 

45 gestures on display objects translate directly to commands, 
so the command window 622 also displays those commands 
that result from performing mouse pointer operations. The 
command window 622 also serves as an interaction history, 
since it maintains a record of all previously executed com- 

so mands. 

The concept editor 708 provides a form interface for 
creating new CLASSIC concept descriptions. The fields in 
the farm include the name of the concept, the type of concept 
(one of primitive, derived, or disjoint-primitive). the parent 

35 concepts), and any additional role restrictions. Editing 
operations on these fields do not affect the contents of the 
knowledge base 109. The knowledge base 109 is changed 
only when the user confirms creation of the concept with an 
explicit commit operation, at which time the concept is 

60 created and classified. Aborting the concept editor leaves the 
knowledge base unchanged. When new concepts are 
created, CLASSIC'S classification inferences correctly 
determine all descriptions that satisfy the membership 
restrictions of the new concept 

65 The use of a knowledge representation systems like 
CLASSIC assists the user in the task of organizing the 
information retrieved from various information sources. By 
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entering an information source description in terms of cally creating or suggesting fillers for these user-determined 

concepts in the knowledge base 169. the system automati- attributes is described below. When the attribute values are 

cally (through classification) determines where to place the satisfactory, the user concludes the editing process by com- 

inforrnation source description in the taxonomy. Since infer- mitting the creation of the new information source descrip- 

mation source descriptions may include many attributes, this 5 tion in the knowledge base, at which point a new object is 

automatic inference step is nontrivial and useful, as a given created and classified. Alternatively, the knowledge base 

information source description may be classified under more object editor 616 allows the process to be aborted at any 

than one concept point in which case no object is added to the knowledge 

Referring to FIG. 6. the user interface 103 includes a base 109. The knowledge base object editor 116 may also be 

hypertext browser 602 and a knowledge base browser/editor to used to modify or add to existing information source 

610. The hypertext browser 602 is functionally similar to descriptions already stored in the knowledge base. In this 

other currently existing WWW browsers. The knowledge case, a new object will not be created when the edits are 

base browser/editor 610 presents a graphical view of the committed, but the object may be reclassified. If the edit is 

world view 115 portion of the knowledge base 109 to the aborted, no changes are made to the object or the knowledge 

user. Navigation of the information space can be done using 15 base. The knowledge base object editor 616 is instantiated in 

either the hypertext browser 602 or the knowledge base the command window 622 (discussed above). 

■••browser/editor 610.~The- user interface- 103 supports both — One way in which the taskof adding information- source - 

navigation paradigms by allowing the user to conveniently descriptions to the knowledge base is supported is by using 

switch between them as appropriate. the drag/drop paradigm In this technique, a user uses a 

The knowledge base browser/editor 610 displays the 20 pointing device, such as a mouse, to select drag, and drop 

world view 115 concepts as a generalization taxonomy. The an iconic representation of an object In the user interface 

relationships among concepts are represented as a directed 103. a user can pick an iconic representation of a document 

graph, in which the nodes, e.g. 612. represent concepts and from the hypertext browser 602. drag it into the knowledge 

the edges, e.g. 614. represent ancestor/descendent subsump- base browser 610. and drop it on a node, e.g. 618. which 

tion relationships between the concepts. One function of the 25 represents a topic concept The iconic representation of a 

knowledge base browser/editor 610 is to provide the user document may be. for example, a hypertext link 620. which 

with an organized overview of the concepts in the world is an active display dement representing the document or 

view 115 of knowledge base 109. Concepts outside the some other iconic representation 624 of the document dis- 

world view 115 are filtered from the display to reduce and played in the hypertext browser 602. 

simplify the amount of information that the interface 103 30 For example, as shown in FIG. 6. the user would point to 

presents to the user. either the hypertext Link 620. or other iconic representation 

As discussed above, when a user finds interesting infor- 624. both of which represent the document currently dis- 

mation from the information sources 123. the user may want played in the hypertext browser 60Z If the user dragged 

to save information source descriptions in the knowledge either hypertext link 620 or other iconic representation 624 

base 109 to expedite future access to the information These as to the Food node 618 in the knowledge base browser/editor 

information source descriptions are added to the knowledge 610. it would indicate that the user wanted to store an 

base 109 by creating descriptions of them in terms of the information source description of the document in the 

domain model 111. When a new information source descrip- knowledge base 109 related to the topic Food. This drag and 

tion is to be created, the user interface 103 provides a drop action causes the knowledge base object editor 616 to 

knowledge base object editor 616 to guide the user in 40 be instantiated in the command window 622. The Topics 

populating the description. Attribute will be populated with the Food concept as a result 

The knowledge base object editor 616 that is instantiated of the user dragging the icon 620 or 624 to the Food node 

when adding an information source description to the 618 in the knowledge base browser/editor 610. As discussed 

knowledge base 109 presents a modifiable template of an above, the system determined attributes, such as URL. 

information source description, expressed as attribute-value 45 Access Time, Content Length. Last Modified, and Change 

pairs. There is one of these pairs for each attribute of an Frequency, are automatically populated by the system 

information source description, with an editable field for the In the case where the user wishes to associate only a 

value(s) to be assigned to that attribute. The knowledge base single topic with the information source description, in mis 

object editor 616 shown in FIG. 6 includes the attributes: example Food the process of adding the information source 

Name. Topics, Description, Annotation, URL (access path), so description to the knowledge base 109 can be done quickly 

Access time, time Last Modified, Change Frequency, and with only a small number of pointer gestures (i.e. without 

Content Length. To niinimize the effort of adding new keyboard interaction). More sophisticated descriptions 

information source descriptions to the knowledge base 109. require additional user interaction through the knowledge 

the system supports this process by automatically extracting base object editor 616. For example, if the user wanted to 

certain attributes from the retrieved document and populat- 55 associate the document with other topics, such as 

ing the appropriate fields of the knowledge base object Entertainment and Incendiary Devices, the user would edit 

editor 616. This process is advisory in the sense that the user the Topics attribute of the information source description in 

has an opportunity to modify or replace the values suggested the knowledge base object editor 616 prior to committing 

by the system before the object is added to the knowledge the information source description to the knowledge base 

base. In the example shown in FIG. 6. the system is able to 60 109. 

automatically provide fillers for all values except for the Another way in which the system supports addition of 

Topics and Annotation attributes. The knowledge base information source descriptions to the knowledge base 109 

object editor 616 is used to modify the system determined is by providing an automatic information extractor 132 

attributes or to add other attributes that cannot be correctly (FIG. 1) which automatically associates the contents of a 

determined by die system. For example, it is the responsi- 65 document with concepts in the world view 115 portion of the 

bility of the user to provide fillers for the Topics attribute of domain model 111. This is done by consulting a mapping of 

an information source. Additional assistance for automati- textual regular expression patterns to world view 115 con- 
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ccpts. When a document is to be added to the knowledge 
base 109. the automatic information extractor 132 matches 
the regular expression patterns against the document text 
For patterns that match* the mapping is consulted to find the 
concept(s) associated with that pattern. The concepts result- 
ing from this matching process are presented to the user as 
possible choices for the attribute to which they apply. For 
example, the patterns could be keywords that relate to the 
topical content of a document, so the matching process 
produces a list of possible fillers for the document's Topic 
attribute. This information is presented to the user through 
the knowledge base object editor 616 on an advisory basis, 
since the matching process is necessarily incomplete and the 
mapping may not necessarily be reliable due to the limited 
expressiveness of the regular expressions. The user has the 
opportunity to edit the attributes using the knowledge base 
— object-editor'616-prior- to storing the information in the 
knowledge base 109. The matching process of the automatic 
information extractor 132 is intended to assist the user in 
determining appropriate concepts for describing the 
document, but the ultimate control and responsibility for 
specifying these concepts remains with the user. 

The knowledge base 109 serves not only as a repository 
for data about information sources but also as a medium for 
browsing and querying. That is. retrieval and display of 
documents can be initiated from the knowledge base 
browser/editor 610. rather than relying solely on the hyper- 
text browser 602. 

The query language used to query the knowledge base 
109 is essentially the same as the CLASSIC language for 
expressing concept descriptions, with some additional 
operators to express operations and restrictions that cannot 
be stated within CLASSICS description logic. CLASSIC 
allows additional restrictions by providing for test-functions 
in the description. These test-functions may have arbitrary 
code to establish membership within a concept descriptioo. 
A query states restrictions in terms of concepts and indi- 
viduals in the knowledge base that circumscribe a collection 
of documents. 

When a query is posed to the system, the query translator 
107 (FIG. 1) converts the query syntax into a CLASSIC 
concept description, which is the canonical form of the 
query used by the CLASSIC knowledge representation 
system 109 for evaluation. Query language operators mat 
cannot be expressed in terms of CLASSIC'S description 
logic are transformed into executable code that is encapsu- 
lated in a CLASSIC test-function, which also becomes part 
of the concept description. After translation of the query to 
a CLASSIC expression, this canonical form is parsed and 
normalized to form an unnamed temporary concept The 
final step in evaluating the query is to request the instances 
(CLASSIC individuals) of this temporary concept. This list 
of instances is formatted and displayed to the user as the 
query result 

One mode of retrieval from the knowledge base 109 is 
browsing, which is a special case of querying that encap- 
sulates a common knowledge base query in a single com- 
mand that is invoked using a pointer gesture in the knowl- 
edge base browser/editor 610. For example, referring to FIG. 
7. clicking a mouse button on node 704. which represents the 
"Information Retrieval** concept in the knowledge base 109 
implies a query to find information source descriptions in the 
knowledge base 109 that have at least one topic that clas- 
sifies under the * Information Retrieval* 1 concept (Le. a topic 
that is a direct instance of this concept or one of its 
descendants). The result of such a browsing operation is to 
display a list 702 in the knowledge base browser/editor 610, 
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of knowledge base objects representing the information 
source descriptions that satisfy the query. The displayed list 
702 of knowledge base objects in the knowledge base 
browser/editor 610 is interactive in the sense that the user 

s can perform a single mouse gesture on one of these objects 
to retrieve the actual document associated with the infor- 
mation source description represented by the pointed to 
object using access path information associated with the 
object Thus, documents associated with the list of displayed 

to knowledge base objects 702 may be retrieved and displayed 
in a manner similar to that described above in connection 
with hypertext links in the hypertext browser 602. 

For queries that cannot be expressed in terms of the above 
described graphical browsing operations, the user has access 

15 to the full query language for describing more complex 
restrictions on collections of documents. An example of 
such a query, paraphrased in^ngUsh. might be-^d docu^ - - 
ments with at least one topic under science that have not 
been accessed since January 1". The user enters these 

20 queries in the textual command window 622. discussed 
above, of the user interface 103. The result of such a query 
is a list of objects representing documents. As with the 
browsing mode of querying, the query result is presented to 
the user as an interactive list of knowledge base objects, so 

25 that individual documents in the collection can be retrieved 
by a pointer gesture on its displayed representation. 

By using the CLASSIC description language as the 
canonical form of a query, the system enables the user to 
organize and save queries in the knowledge base 109 for 

30 later reuse. This gives the user a convenient way to execute 
idiomatic or frequently stated queries. The query is saved by 
converting the intermediate form of the query, an unnamed 
temporary concept, into a named concept. Creating a named 
concept makes the query a permanent part of the knowledge 

35 base 109. As with any other concept these query concepts 
are classified into an appropriate position in the generaliza- 
tion taxonomy, so the knowledge base 109 assists not only 
in storing the queries but also in organizing them (ie. the 
knowledge base can recognize mat one query is a generali- 

40 zan'on of another). These queries may be displayed in the 
knowledge base browserVeditor 610 to visually show the 
relationships between them. Since the query is concisely 
represented as a named object in the knowledge base 109. 
subsequent execution of the query can be expressed with a 

45 single browsing operation as described above in connection 
with knowledge base browsing. 

Some of the interactions between the hypertext browser 
602 and the knowledge base 109 occur implicitly as a side 
effect of another operation, such as hypertext browsing. The 

50 system keeps track of hypertext browsing operations that 
might affect data stored in the knowledge base 109. Such 
interactions are transparent to the user, as opposed to explicit 
interactions initiated by the user, such as adding a document 
to the knowledge base. An example of such an implicit 

55 interaction is based on the access time of a document. If, 
while browsing the WWW. the user encounters a document 
for which an information source description has previously 
been stored in the knowledge base, the system will note this 
by automatically updating the Access Time attribute of the 

60 document's information source description in the knowl- 
edge base. Other information source description attributes 
which may be implicitly updated in the manner include 
Content Length and Last Modified. 
The user interface includes a shelf 704. which is an area 

65 on the display which functions as a multimedia scratchpad 
for storing interactive screen objects for later use. Any 
pointer sensitive object in the display (e.g. hypertext link 
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708 from the hypertext browser 602. concept nodes 618 
(FIG. 6) from the knowledge base browser/editor 610. etc.) 
can be picked up and dragged into the shelf 704. thus 
creating a copy of the object The items placed in the shelf 
704 retain their original interactive behavior. For example a 
hypertext link copied to the shelf 704 can be clicked on to 
retrieve a document just as it could when the same gesture 
was performed on the hypertext link in the hypertext 
browser 602. 

The user interface 103 also includes a knowledge base 
overview browser 706. which provides a birds-eye view of 
the directed graph displayed in the knowledge base browser/ 
editor 610. This knowledge base overview browser 706 
provides the user with an alternative view of the entire 
knowledge base concept graph, which is typically too large 
to fit entirely within the visible portion of the knowledge 
base browser/editor 610. 

The^user"iiUerface"103"also 'includes a" path history 
browser 800, which is shown in FIG. 8. This path history 
browser 800 displays an interactive graphical history of 
which information sources the user has visited during a 
session. The nodes, e.g. 802 in the path history browser 800 
represent information sources (e.g. documents) that the user 
has visited (i.e. retrieved and displayed in the hypertext 
browser 602). with the edges, e.g. 804. representing the 
hypertext links between them. The user can interact with this 
history by clicking on the nodes, which returns the hypertext 
browser 602 context to the information source associated 
with that node. 

Combining Structured And Unstructured Data Sources 

The foregoing detailed description of the improved user 
interface 103 described a user interface embodied in a 
WWW browser. The information sources in the WWW are 
generally classified as unstructured data sources, in that the 
data is not organized in a structured manner. In order to find 
information on the WWW. a user browses the information 
space using the hypertext browser 602. Each document 
displayed in the hypertext browser may contain pointers, or 
hypertext links, to other related documents. In this manner, 
the user navigates the WWW to find useful information. 
When useful information is found, the user may save an 
information source description in the knowledge base 109 as 
described above. 

The detailed description of the parent application. U.S. 
patent application Ser. No. 08/347.016. which was substan- 
tially included at the beginning of the application, describes 
the retrieval of information from a plurality of information 
sources, which sources are generally classified as structured 
data sources, in that the data is organized in some structured 
manner (e.g. a relational database). Information is generally 
retrieved from a structured data source by means of a query 
on the database, rather than by browsing. 

Another aspect of the present invention is the integration 
of such structured and unstructured data sources as 
described below. 

There are several ways in which structured and unstruc- 
tured data sources can be combined to provide for an 
improved information retrieval system. 

The user interface 103 can use the context of the knowl- 
edge base browser/editor 610 to help formulate a query. 

The answer to a query can be a set of points to start 
browsing, or. more generally, can be presented as a 
hypertext document with explanations of the answers 
and pointers for further browsing. 

A more principled combination of structured and unstruc- 
tured information sources. 

Each of these techniques is described in further detail 
below. 
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Using Browsing Context for Query Formulation 
Suppose a user is browsing the knowledge base 109 using 
the knowledge base browser/editor 610. Le.. the user is at 
some point in the concept hierarchy. At this point the user 
may want to pose a more specific query about the instances 
of that concept The system can automatically insert a 
conjunct in the query mat limits the answers to instances of 
the class. It can also suggest some role names for which the 
user may want to specify values or ranges. 

For example, suppose one is browsing the knowledge 
base and is at the concept of Al-researcher. Hie user may be 
looking for those researchers in the class whose area of 
expertise is planning. Instead of posing the query 
15 AI-researcher(x) a expertise(x.planning). the user only 
specifies expertise=planning. and the system fills in the first 
conjunct of the query. Furthermore^the system may- pep -a- * 
menu for the user in which he can see the possible restric- 
tions he can pose on Al-researcher. such as affiliation. 
20 expertise, etc. 

Using Query Answers to Start Browsing 
An answer to a structured query is essentially a list of 
tuples satisfying the query (as in relational databases). One 
or more attributes to these tuples may be a URL. This URL 
25 may be used to begin browsing in the unstructured data 
sources. For example, we may query for AI researchers, 
whose areas of expertise is planning, and the answer may be 
a set of tuples of the form (name, home-page-url). These 
tuples can be presented to the user as a hypertext document, 
including hypertext links, in the hypertext browser 602. and 
the user can then start browsing from there. 

More generally, a tuple may be described to the user in a 
hypertext document. (In the examples which follow, the 
35 vn^fUr^py indicates that the displayed text represents a 
hypertext link). For example, instead of displaying tuples 
such as: 



Bait Schmn AT&T Bell Labs homo-page 
Oren Eizioni University of Washington home-page 



we can display: 

The known AI researchers whose area of expertise is 
Planning are: 

Bart Selman whose affiliation is AT&T Bell Labs. Click 

here for his home page. 
Oren Etzioni whose affiliation is U. of Washington. Click 

here for his home page. 
Straightforward heuristics may be employed to generate 
the English phrases connecting the attributes. 
A Principled Combination 

We now describe a more general approach to answering 
queries that incorporates structured and unstructured infor- 
mation sources. We first illustrate the approach with an 
example, and then describe the general framework. 

Suppose the query is DBConference(x,y,1995) 
ATemperature(y.z). In words, the y is the city in which the 
database conference x is being held in 1995. and z is the 
average temperature in the city y (ignoring the specific 
month, for now). 

We may have access to a structured information source 
65 (i-e.. a database) that tells us where the database conferences 
are being held in 1995. For example it may contain the 
tuples. 
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SIGMOD 


Washington D C 


1993 


SIGMOD 


Minneapolis 


1994 


SIGMOD 


San Jose 


1995 


VLDB 


Dublin 


1993 


VLDB 


Samiago 


1994 


VLDB 


Zurich 


1995 



However, we may not have access to structured informa- 
tion sources that provide the temperatures in specific cities. 
Instead, we have access to several unstructured information 
sources, which give textual descriptions to the weather, 
including the temperatures. However, these unstructured 
information sources do not have an internal structure that 
enables extraction of the temperature in a standard fashion. 
For example, we may have the unstructured sources: 
™~Califc™a" we^ " "~ " 

Switzerland tourist information server 

San Jose city server 

Trying to solve the first subgoal of our query will yield the 
two facts: 

DBConference(SIGMOD. San Jose. 1995). and 

DBConference(VLDB. Zurich. 1995) 
and therefore, to answer the query, we need to solve the 
subgoals: 

Temperature(San Jose. z). and 

Temperature(Zurich. z) 

At this point, we can use some background knowledge 
about the unstructured sources we have. For example, we 
can infer that the California weather server may contain, in 
an unstructured fashion, the temperature in San Jose. This is 
inferred because San Jose is in California and the concept of 
weather is very closely related to the concept of temperature. 
Similarly, we can infer that the Switzerland tourist informa- 
tion server will have weather informauon about Zurich, also 
in an unstructured fashion, because tourist information usu- 
ally includes weather. Therefore, the system can display the 
following to the user: 

The SIGMOD conference will be held in San Jose. Calif, 
in 1995. and the weather in San Joe can be found by 
clicking hej£ (California weather server) or here (San 
Jose city server). 

The VLDB conference will be held in Zurich, Switzerland 
in 1995. and the weather in Zurich can be found by clicking 
hejee (Switzerland tourist information server). 

This example illustrates two things. First, the final answer 
to the query is not given by the system itself, but rather by 
the user browsing some relevant unstructured information 
sources. However, the system's query processor uses the 
structured sources used as much as possible to prune which 
unstructured sources will be browsed in order to complete 
the answer to the query. 

In general, the framework can be described as follows. 
Suppose our query is of the form: 

where the X,*s are tuples of variables, and the Q,*s are 
predicate names. For simplicity, assume that all conjuncts in 
the query except for the last one can be answered by 
structured sources. 

Let X n _ l be the set of variables that appear in one of the 
first n-i conjuncts (i.c. X A u UX^). 

We first solve the first n— 1 conjuncts of the query, that is. 
we obtain tuples of X„_t that satisfy the query (in our 
example, these variables were x. the conference name, and 
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y. the city in which it is held). For each tuple t we then 
consider the last conjunct of the query. Some of the variables 
in X n _ l appear in that conjunct, therefore, for each tuple 
obtained for X^, we obtain a partial instantiation of the last 

5 conjunct which we denote by Qja,) (note, the tuple a, at 
contains elements from the tuple t and the variables from X n 
that do not appear in X^j). In our example, one such an 
instantiation would be Temperature(San Jose. z). 
The conjunct Q„(a^ is given as input to a module that 

to decides which unstructured sources are relevant to it At the 
simplest we can take the names occurring in a, and the name 
of Q, and feed it to an information retrieval system (e.g.. San 
Jose and weather in our example). Alternatively, we may 
simply check whether these names match the topics by 

15 which an unstructured source is described. A different pos- 
sibility is to use some more sophisticated reasoning about 
the names occurring -in-the conjunct -Q^a^ to. determine 
relevant sources (as illustrated in the example). 
Therefore, for each tuple t we obtain a set of sources s r 

20 The answer presented to the user is the set of pairs (t s). 
where ses P 

The foregoing Detailed Description is to be understood as 
being in every respect illustrative and exemplary, but not 
restrictive, and the scope of the invention disclosed herein is 

25 not to be determined from the Detailed Description, but 
rather from the claims as interpreted according to the full 
breadth permitted by the patent laws. For example, while the 
system of the invention is advantageously implemented 
using the CLASSIC knowledge base system, the principles 

30 of the invention are by no means restricted to that system. 
The invention may be implemented in other knowledge 
based systems, as well as other types of database systems 
which allow for storage of objects in a structured manner. 
For example, if an object oriented database were used, the 

35 nodes in a directed graph representation of the database 
would represent classes of information and the edges would 
represent relationships between those classes. 
We claim: 

1. An information retrieval apparatus for retrieving infor- 
40 mation and for managing said retrieved informauon. the 

apparatus comprising: 
a structured database; 

a document browser for displaying retrieved information; 

a database browser for displaying a visual representation 
of the structure of said database; 

means for requesting a transfer of information from said 
document browser to said database; and 

storage means responsive to said means for requesting for 
50 storing information source descriptions in said 
database, said information source descriptions includ- 
ing at least an access path description and a content 
description of said retrieved inforrnatioa 

2. The information retrieval apparatus of claim 1 wherein 
55 said visual representation of the structure of said database is 

a directed graph including nodes and edges, said nodes 
representing classes and said edges representing relation- 
ships between said classes and wherein, 
said means for requesting further comprises means for 
60 graphically representing a transfer of information from 
said document browser to a particular node in said 
directed graph; and 
said storage means further comprises means for storing 
said information source descriptions in said database 
65 based upon said particular node. 

3. The information retrieval apparatus of claim 1 further 
comprising: 
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information retrieval means for retrieving information; 

query generation means responsive to said database 
browser for generating a database query; and 

query execution means responsive to said query for 
retrieving information source descriptions from said 5 
database and for displaying an interactive list of said 
information source descriptions in said database 
browser; 

wherein said information retrieval means is responsive to 
said interactive list of information source descriptions 10 
for retrieving information. 

4. The information retrieval apparatus of claim 3 further 
comprising; 

a textual query editor for modifying the query generated 
by said query generation means, is 

5. The information retrieval apparatus of claim 1 wherein 
said* inf ormatioir source descriptions further include infor- 
mation access attributes, said apparatus further comprising: 

information retrieval means for retrieving information; 

and 20 
attribute update means responsive to said document 

browser for updating said information access attributes 

in the database when information is retrieved by said 

information retrieval means. 

6. The information retrieval apparatus of claim 1 wherein 25 
said document browser is a hypertext browser. 

7. The information retrieval apparatus of claim 1 wherein 
said database is a knowledge base. 

8. A user interface far an information retrieval system for 
managing information retrieved from a plurality of inf or- 30 
mation sources, said information retrieval system including 
storage means for storing information source descriptions in 

a structured database, said user interface comprising: 

a hypertext browser for displaying a retrieved document 
and an iconic representation of said document on a 35 
computer display screen; 

a database browser for displaying a visual representation 
of said database on said computer display screen; and 

graphical pointing means for graphically representing a ^ 
transfer of said iconic representation of said document 
from said hypertext browser to said visual representa- 
tion of said database in said database browser; 

wherein said storage means is responsive to said graphical 
pointing means for storing an information source AS 
description as an object in said database. 

9. The user interface apparatus of claim 8 further com- 
prising: 

an object editor for textually editing said information 
source description object prior to storing it in said 50 
database. 

10. The user interface apparatus of claim 9 further 
wherein said information source description object com- 
prises attributes, the apparatus further comprising: 

an automatic information extractor for automatically 55 
extracting information source description attributes 
from said retrieved document and for populating the 
object editor with said attributes. 

11. The user interface apparatus of claim 8 wherein said 
database is a knowledge base including concepts relating to 60 
the information in said information sources, and wherein 
said visual representation displayed by said database 
browser is a directed graph with the nodes representing 
concepts and the edges representing relationships between 
said concepts, wherein: 55 

said graphical pointing means further comprises means 
for graphically representing a transfer of said iconic 
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representation of said document from said hypertext 
browser to a particular node in said directed graph; 
wherein said storage means is responsive to said graphical 
representation of a transfer of said iconic representation 
to a particular node, for storing an information source 
description related to the concept represented by said 
particular node. 

12. The user interface apparatus of claim 8 wherein said 
iconic representation is a hypertext link. 

13. The user interface of claim 12 further comprising: 

a scratchpad area for storing copies of original interactive 
screen objects, wherein said copies retain the interac- 
tive properties of the original objects. 

14. The user interface of claim 8 further comprising: 
query generation means responsive to said graphical 

pointing means jfor g^^^S.^A?!^^^.9!^Ty^\^ 
response to a user pointing to a portion of said visual 
representation of said database using said graphical 
pointing means; 

query execution means for executing said generated query 
and for displaying query results on said computer 
display screen as an interactive list of information 
source descriptions* 

wherein said information retrieval system is responsive to 
a user pointing to one of said information source 
descriptions displayed in said interactive list for retriev- 
ing the information relating to said information source 
description and for displaying said retrieved informa- 
tion in said hypertext browser. 

15. An information retrieval apparatus for satisfying a 
request for information by retrieving information from a set 
of unstructured data sources and a set of structured data 
sources, the apparatus comprising: 

query execution means including 
query plan generating means responsive to a first query 

for generating a query plan and 
query plan execution means responsive to the query 

plan for retrieving query result information from at 

least one structured data source from said set of 

structured data sources; 
pruning means for identifying a subset of said unstruc- 
tured data sources using said query result information; 
and 

a text browser responsive to said pruning means for 
browsing said subset of unstructured data sources and 
for retrieving information responsive to said first query. 

16. A method of organizing retrieved information in an 
information retrieval system, said method comprising the 
steps: 

displaying a retrieved document and an iconic represen- 
tation of said document in a text browser on a computer 
display screen; 

displaying a graphical representation of a structured data- 
base in a database browser on said computer display 
screen; 

storing an information source description of said docu- 
ment in said structured database in response to a user 
request said structured information source description 
including at least an access path description and a 
content description. 

17. The method of claim 16 wherein said database is a 
knowledge base including concepts relating to the semantic 
content of the retrieved document and wherein said graphi- 
cal representation displayed by said database browser is a 
directed graph with the nodes representing concepts and the 
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edges representing relationships between said concepts, 
further comprising the steps of: 
dragging said iconic representation from said text browser 

to a particular node in said directed graph, 
wherein said step of storing further comprises the step of 
storing an information source description of said docu- 
ment related to the concept represented by said par- 
ticular node, 

18. The method of claim 17 further comprising the steps: 
pointing to a particular node in said directed graph; 
displaying in said database browser an interactive list of 

the information source descriptions which are instances 

of the concept represented by said particular node; 
pointing to a particular information source descriptions in 15 

said interactive list; 
" retrieving a document represented by said particular infor- ~ 

mation source description; and 
displaying said document in said text browser. 

19. An information retrieval method for satisfying a 20 
request for information using a set of unstructured data 
sources and a set of structured data sources, the method 
comprising the steps: 

generating a first query; 25 
executing said first query and retrieving query result 
information from a structured data source; 
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pruning said set of unstructured data sources using said 
query result information to identify a subset of said 
unstructured data sources; 

browsing said subset of said unstructured data sources 
with a text browser to retrieve information responsive 
to said first query. 
20. Apparatus for adding information retrieved from a 
communications network to a body of information having an 
organization, the apparatus comprising: 

a display of a representation of the retrieved information; 

a display of a non-textual representation of the organiza- 
tion; 

interactive means far moving the representation of the 
„ retrieved- information to a . portion of the nontextual , 
representation; and 

means responsive to the interactive means for incorpo- 
rating an information source description of the 
retrieved information into the body of information as 
specified by the portion of the non-textual representa- 
tion to which (he representation of the retrieved infor- 
mation was moved. 
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